You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Continuing work in #4052, api.couchers.org/status is live, but would be even more beneficial to the team if we were notified of down time and when the API and database come back up.
Implementation
We can run a containerized python script alongside our other network tools that will periodically query this URL for database connectivity. We'll need to ensure the script can't fail so that downtime is not attributed to this broken script.
Inspiration
@aapeliv uses the following notification script currently to detect infrastructure issues, which we can adapt and improve upon:
importloggingfromdatetimeimportdatetime, timedelta, timezonefromtracebackimportformat_exceptionimportrequestslogger=logging.getLogger(__name__)
FAIL_NOTIFICATION_INTERVAL=timedelta(minutes=5)
RATE_LIMIT_PERIOD=timedelta(minutes=120)
RATE_LIMIT=12defnow():
returndatetime.now(timezone.utc)
defnow_stamp():
returnnow().strftime("%y-%m-%d %H:%M:%S UTC")
defgen_alert():
try:
r=requests.get("https://api.couchers.org/status", timeout=5)
j=r.json()
ifnotint(j["coucherCount"]) >10_000:
raiseException("coucher_count does not exceed 10k")
logger.info(f"Couchers API seems OK")
exceptExceptionase:
traceback="".join(format_exception(type(e), e, e.__traceback__))
returnf"Couchers API seems down as of {now_stamp()}, traceback:\n\n{traceback}"notifications= []
defshould_rate_limit_notifs():
globalnotificationscutoff=now() -RATE_LIMIT_PERIODnotifications= [nforninnotificationsifn>=cutoff]
# whether to rate limit, whether to send rate limit notifreturnlen(notifications) >=RATE_LIMIT, len(notifications) ==RATE_LIMITdefalert(message):
globalnotificationsdef_send(msg):
# TODO: send notificationpassshould_limit, should_inform=should_rate_limit_notifs()
ifnotshould_limitorshould_inform:
_send(message)
ifshould_inform:
_send("Rate limiting notifications.")
fail_start=Nonelast_notify=Nonedefrun():
globallast_notify, fail_startalert_msg=gen_alert()
ifalert_msg:
ifnotfail_start:
fail_start=now()
logger.error(alert_msg)
ifnotlast_notifyornow() -last_notify>FAIL_NOTIFICATION_INTERVAL:
alert(alert_msg)
last_notify=now()
else:
logger.info(f"Last sent message at {last_notify}, so not re-sending yet")
else:
iffail_start:
alert("Couchers API back up")
last_notify=Nonefail_start=None
The text was updated successfully, but these errors were encountered:
Motivation
Continuing work in #4052, api.couchers.org/status is live, but would be even more beneficial to the team if we were notified of down time and when the API and database come back up.
Implementation
We can run a containerized python script alongside our other network tools that will periodically query this URL for database connectivity. We'll need to ensure the script can't fail so that downtime is not attributed to this broken script.
Inspiration
@aapeliv uses the following notification script currently to detect infrastructure issues, which we can adapt and improve upon:
The text was updated successfully, but these errors were encountered: