Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment stopped unexpectedly #672

Closed
sentry-io bot opened this issue Sep 3, 2024 · 12 comments
Closed

Environment stopped unexpectedly #672

sentry-io bot opened this issue Sep 3, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@sentry-io
Copy link

sentry-io bot commented Sep 3, 2024

Sentry Issue: POSEIDON-73

Environment stopped unexpectedly

The issue occurred on our staging environment, but with the environment_id = 6 (which is the very same image openhpi/co_execenv_python:3.4 we also use in production). Let's double check what's going on here and whether the event was expected and whether we have all data to identify a root cause.

@sentry-io sentry-io bot added the bug Something isn't working label Sep 3, 2024
@MrSerth
Copy link
Member

MrSerth commented Sep 4, 2024

The previous occurrence was caused by an ongoing investigation for #612. Hence, there is nothing to investigate right now.

@MrSerth MrSerth closed this as completed Sep 4, 2024
Copy link
Author

sentry-io bot commented Sep 10, 2024

Sentry Issue: POSEIDON-78

@MrSerth
Copy link
Member

MrSerth commented Sep 10, 2024

The issue reappeared during the past night on production. During this night, we had severe issues with the cooling in our data center and many servers throttled. For the hardware-related monitors (such as max CPU temperature) we see an impact from 1:30am CEST to 7:00am CEST on September 10th.

Despite this hardware issue, we should still investigate whether the log entries we see are fine and also double-check how the systems recovered (due to our pre-warming alert?).

@MrSerth MrSerth reopened this Sep 10, 2024
@mpass99
Copy link
Contributor

mpass99 commented Sep 11, 2024

I investigated on the template-10.
Today, at 02:39 AM UTC Nomad Agent 3 got restarted. (After it restarted) the agent drained its jobs. template-10 was migrated to Nomad Agent 1. No Prewarming Alert was fired that night. It seems fine.

@MrSerth
Copy link
Member

MrSerth commented Sep 20, 2024

The past night, we saw this issue again on production with about 30 events (= multiple events per execution environment). For two of the environments affected (28, 22), we also saw a prewarming pool alert in POSEIDON-4V.

I would assume that the behavior we observed past night (September 20th, around 3:53am - 4:23am CEST) is not fine and should be investigated?

@mpass99
Copy link
Contributor

mpass99 commented Sep 22, 2024

The unattended-upgrades-dpkg.log shows that Docker needed to be updated on the 20th. On the 21st AppArmor, systemd-hwe-hwdb, procps, docker-ce and others were updated.

When Docker restarts all containers are stopped, Nomad restarts (PartOf dependency), and Poseidon receives the notification about an (unexpected) stop of an environment. However, as we can see in the Nomad logs, the job was successfully migrated.

I would consider the behavior as acceptable. However, we might consider spending efforts to filter out such warnings for migrations (if possible).

Nomad Logs

sudo nomad job status -namespace poseidon template-28
ID            = template-28
Name          = template-28
Submit Date   = 2024-09-05T12:37:21Z
Type          = batch
Priority      = 50
Datacenters   = dc1
Namespace     = poseidon
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost  Unknown
config         0       0         0        0       0         0     0
default-group  0       0         1        0       7         0     0

Allocations
ID        Node ID   Task Group     Version  Desired  Status    Created    Modified
02274da0  cb04341c  default-group  0        run      running   1d5h ago   1d5h ago
f140584c  15fa2016  default-group  0        stop     complete  2d6h ago   1d5h ago
4620a698  919ea902  default-group  0        stop     complete  4d6h ago   2d6h ago
7f2f3480  47501b8a  default-group  0        stop     complete  4d6h ago   4d6h ago
6beb5ac6  15fa2016  default-group  0        stop     complete  11d5h ago  4d6h ago
5f207cee  47501b8a  default-group  0        stop     complete  11d7h ago  11d5h ago
d52933fe  919ea902  default-group  0        stop     complete  12d6h ago  11d7h ago

@MrSerth
Copy link
Member

MrSerth commented Sep 22, 2024

Thanks for your explanation, both make sense. Despite the issue on the 20th, I also saw errors being reported for the night on 21st, which match with the list of packages that were updated (especially docker-ce).

For most of the occurrences, we can cluster the events into the number of Nomad agents we have, further indicating that these events are related to the unattended-upgrades. Hence, in general, I agree that these seem to represent some "acceptable" behavior, but I still imagine the setup could be more "robust".

From my point of view and the events we observe, it looks like the migration of running allocations is not that reliable or working as expected: Can we change the dependency of Docker and Nomad, so that Nomad is gracefully stopped first (with the drain_on_shutdown, etc.) and only when Nomad is fully stopped and no allocation is running, we then proceed with restarting Docker? While the PartOf= relationship restarts Nomad, it does so only after Docker has been stopped (which might be too late).

@mpass99
Copy link
Contributor

mpass99 commented Sep 23, 2024

We might do so by specifying the After dependency [1]. However, we might not hit this issue with that measure:

  1. As far as I know, we are not observing unreliable migrations in this issue
  • This might be an indicator that Nomad can handle stopped Docker containers better than draining Nomad Agents. With the measure, we might worsen the situation.
  1. Due to Nomad's Graceful Shutdown behavior (Nomad Agent Restart Processes #677) the systemd shutdown does not wait for the draining. This might be a Nomad bug. However, it renders this systemd-based measure ineffective.
  2. Even if we manage to reduce a Docker restart to a Nomad Agent Restart, we would still receive the same Sentry warnings. This is due to the handling of Poseidon. When an allocation of an environment job stops and the environment has not been removed by the Poseidon API, we receive the warning. It does not mind whether the allocation is just being restarted/rescheduled/migrated.

@MrSerth
Copy link
Member

MrSerth commented Sep 23, 2024

However, we might not hit this issue with that measure

Mh, I see. My main objective is still to improve the reliability of unattended upgrades and the restarts/reschedules/migrations happening therewith.

The background is that I regularly receive messages about potential failures: Either through Sentry or through Icinga (reported in Slack for me), checking both CodeOcean and Poseidon. For each of these, I need to check whether a real issue occurred that potentially needs intervention or whether we observed the same reoccurring issues (like this one or the prewarming pool alert). Hence, I would like to get more relevant notifications. "Just" touching the notifications itself (like muting, changing their log level, etc.) somewhat feels weird and, if possible, I would like to address the root cause.

I do understand that the "Environment stopped unexpectedly" issue is "expected" as soon as something happens with the environment (like a restart/reschedule/migration). Do you see any option to improve Poseidon's reaction to these events?

Similarly, but this might be a topic for another issue, I would welcome a change that doesn't trigger the prewarming pool alert during nightly unattended-upgrades. Ideally, by ensuring we have enough runners available in the pool, not just by silencing them. Hence, I thought of the drain_on_shutdown option or any other mechanism to offload the workload of the affected agent first. Maybe we need to listen to eligibility events originating from Nomad and invoke a reschedule/migration ourselves? I really don't know...

@mpass99
Copy link
Contributor

mpass99 commented Sep 24, 2024

Do you see any option to improve Poseidon's reaction to these events?

Yeah, let's address this with #693

Don't trigger the prewarming pool alert during nightly unattended-upgrades

That definitely should not happen! Especially, as we have the randomized unattended-upgrades time. In the context of #587, we should look at the individual events and analyze if the lost runners are caused by the bug submitted with hashicorp/nomad#23937 or if they have different causes.

@MrSerth
Copy link
Member

MrSerth commented Sep 24, 2024

Do you see any option to improve Poseidon's reaction to these events?

Yeah, let's address this with #693

Nice, thank you!

Don't trigger the prewarming pool alert during nightly unattended-upgrades

That definitely should not happen! Especially, as we have the randomized unattended-upgrades time. In the context of #587, we should look at the individual events and analyze if the lost runners are caused by the bug submitted with hashicorp/nomad#23937 or if they have different causes.

Okay, you're right, of course. We should not simply mute it, but I still feel my intention is valid: Reducing the notification "noise" I get. Unfortunately, I am just running out of good ideas. It feels like we would need some response to our upstream issue before we can continue with further investigation, because this would allow us to focus more on other causes (if any).

@MrSerth
Copy link
Member

MrSerth commented Sep 25, 2024

Recently, we were discussing two different problems in this issue:

  1. First, we talked about the actual "Environment stopped unexpectedly". The false positives we have seen recently were fixed with Restrict false positive warning #693. Hence, this problem is considered done until it occurs another time.
  2. Second, we talked about the occurrence of the prewarming pool alert. However, this has a dedicated issue in Prewarming Pool Alert #587, which is more suitable to discuss potential improvements in the future.

Hence, we are closing this issue as being done.

@MrSerth MrSerth closed this as completed Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants