-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Environment stopped unexpectedly #672
Comments
The previous occurrence was caused by an ongoing investigation for #612. Hence, there is nothing to investigate right now. |
Sentry Issue: POSEIDON-78 |
The issue reappeared during the past night on production. During this night, we had severe issues with the cooling in our data center and many servers throttled. For the hardware-related monitors (such as max CPU temperature) we see an impact from 1:30am CEST to 7:00am CEST on September 10th. Despite this hardware issue, we should still investigate whether the log entries we see are fine and also double-check how the systems recovered (due to our pre-warming alert?). |
I investigated on the |
The past night, we saw this issue again on production with about 30 events (= multiple events per execution environment). For two of the environments affected (28, 22), we also saw a prewarming pool alert in POSEIDON-4V. I would assume that the behavior we observed past night (September 20th, around 3:53am - 4:23am CEST) is not fine and should be investigated? |
The When Docker restarts all containers are stopped, Nomad restarts (PartOf dependency), and Poseidon receives the notification about an (unexpected) stop of an environment. However, as we can see in the Nomad logs, the job was successfully migrated. I would consider the behavior as acceptable. However, we might consider spending efforts to filter out such warnings for migrations (if possible). Nomad Logs
|
Thanks for your explanation, both make sense. Despite the issue on the 20th, I also saw errors being reported for the night on 21st, which match with the list of packages that were updated (especially For most of the occurrences, we can cluster the events into the number of Nomad agents we have, further indicating that these events are related to the unattended-upgrades. Hence, in general, I agree that these seem to represent some "acceptable" behavior, but I still imagine the setup could be more "robust". From my point of view and the events we observe, it looks like the migration of running allocations is not that reliable or working as expected: Can we change the dependency of Docker and Nomad, so that Nomad is gracefully stopped first (with the |
We might do so by specifying the
|
Mh, I see. My main objective is still to improve the reliability of unattended upgrades and the restarts/reschedules/migrations happening therewith. The background is that I regularly receive messages about potential failures: Either through Sentry or through Icinga (reported in Slack for me), checking both CodeOcean and Poseidon. For each of these, I need to check whether a real issue occurred that potentially needs intervention or whether we observed the same reoccurring issues (like this one or the prewarming pool alert). Hence, I would like to get more relevant notifications. "Just" touching the notifications itself (like muting, changing their log level, etc.) somewhat feels weird and, if possible, I would like to address the root cause. I do understand that the "Environment stopped unexpectedly" issue is "expected" as soon as something happens with the environment (like a restart/reschedule/migration). Do you see any option to improve Poseidon's reaction to these events? Similarly, but this might be a topic for another issue, I would welcome a change that doesn't trigger the prewarming pool alert during nightly unattended-upgrades. Ideally, by ensuring we have enough runners available in the pool, not just by silencing them. Hence, I thought of the |
Yeah, let's address this with #693
That definitely should not happen! Especially, as we have the randomized unattended-upgrades time. In the context of #587, we should look at the individual events and analyze if the lost runners are caused by the bug submitted with hashicorp/nomad#23937 or if they have different causes. |
Nice, thank you!
Okay, you're right, of course. We should not simply mute it, but I still feel my intention is valid: Reducing the notification "noise" I get. Unfortunately, I am just running out of good ideas. It feels like we would need some response to our upstream issue before we can continue with further investigation, because this would allow us to focus more on other causes (if any). |
Recently, we were discussing two different problems in this issue:
Hence, we are closing this issue as being done. |
Sentry Issue: POSEIDON-73
The issue occurred on our staging environment, but with the
environment_id = 6
(which is the very same imageopenhpi/co_execenv_python:3.4
we also use in production). Let's double check what's going on here and whether the event was expected and whether we have all data to identify a root cause.The text was updated successfully, but these errors were encountered: