Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branding: improve IO resilience #3182

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kwzrd
Copy link
Contributor

@kwzrd kwzrd commented Oct 9, 2024

Closes #2808

This PR aims to improve the resiliency of the Branding manager cog. It is a rework of #2869, which I've closed and re-implemented in this branch in adherence with the reviewer's request to use the tenacity lib.

Context

The Branding manager automatically discovers events once per day or when instructed to. During each such discovery, it makes multiple HTTP requests to the GitHub API. First, it fetches a list of all events in the Branding repository, and then it fetches those events one-by-one.

Currently, if the bot fails to fetch an event, it simply skips it. This seemed useful initially, as we didn't want the bot to stop the discovery iteration if just one event is badly configured. In practice, however, this causes undesired behaviour: if the currently active event is skipped due to a 5xx error, the bot may reset the guild's branding to the fallback event, despite the event still being in progress.

This PR makes adjustments to the error handling methodology. Now, if any of the requests fail, the whole discovery iteration is aborted & retried the following day (or manually via the sync command). The last valid state remains in the Redis cache, so the cog will continue to work just fine.

Additionally, 5xx errors are now retried up to 5 times, with exponential backoff. This prevents discovery iterations from being cancelled due to stray server errors.

Implementation

The solution is fairly simple. The repository class no longer swallows IO/deserialization errors, and instead lets them propagate to the cog. The cog mostly already handles exceptions coming from the repo, I just added error handling where it was missing.

The calendar refresh can now fail (well, it could always fail, but the error was hidden). I added an embed response to inform the user of the result:

image
image

Because of this, it no longer automatically invokes the calendar view after the refresh - it seemed noisy. But feel free to propose other solutions.

I've used the Apache 2.0 licensed tenacity library for the retry with backoff.

If we fail to fetch an event, the whole branding sync will
now be aborted. This will prevent situations where we fail
to fetch the current event due to a 5xx error and the cog
resorts to the fallback branding in the middle of an event.

Error handling is moved to the cog. The repo abstraction
will now propagate errors rather than silence them.
Use the tenacity lib to retry 5xx responses from GitHub.
We now send a result embed after refresh. It would be noisy
to also send the calendar embed.

Users can invoke the calendar manually if desired.
@kwzrd
Copy link
Contributor Author

kwzrd commented Oct 9, 2024

For testing purposes, you can do this. :trollface:

async with self.bot.http_session.get(...) as response:
    response.status = 500
    _raise_for_status(response)

The cog is overall not very testable. I have some ideas on how to improve it, but I'd rather leave it for a separate PR.

Copy link
Member

@ChrisLovering ChrisLovering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👌 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Branding cog - RuntimeError: Failed to fetch file due to status: 503
2 participants