You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've had reports that github-runner-operator deployed runners occasionally stop receiving jobs.
I did some investigation and discovered when oom-reaper is invoked and kills the VMs, the service does not recover.
I was able to get the runners to recover by rebooting the system; restarting the charm and LXD did not fix the issue.
To Reproduce
N/A; from my investigation it would appear this is caused by the runners running out of memory and oom-killer.
This turned out to be a memory issue.
oom-killer killed a bunch of the VMs on xlarge/2* xlarge/2 was stuck in the following state:
[Thu Feb 2 15:07:36 2023] Out of memory: Killed process 3018113 (qemu-system-x86) total-vm:17906732kB, anon-rss:44264kB, file-rss:0kB, shmem-rss:6236532kB,
[Thu Feb 2 18:36:52 2023] Out of memory: Killed process 3016510 (qemu-system-x86) total-vm:17915952kB, anon-rss:43820kB, file-rss:0kB, shmem-rss:5447228kB,
[Thu Feb 2 22:25:03 2023] Out of memory: Killed process 548735 (qemu-system-x86) total-vm:18131036kB, anon-rss:45532kB, file-rss:0kB, shmem-rss:4885312kB,
[Thu Feb 2 22:41:01 2023] Out of memory: Killed process 547841 (qemu-system-x86) total-vm:18166784kB, anon-rss:48156kB, file-rss:0kB, shmem-rss:4816036kB,
[Thu Feb 2 23:05:39 2023] Out of memory: Killed process 2953011 (qemu-system-x86) total-vm:18049308kB, anon-rss:43928kB, file-rss:0kB, shmem-rss:2984476kB,
[Thu Feb 2 23:59:43 2023] Out of memory: Killed process 2951372 (qemu-system-x86) total-vm:18185036kB, anon-rss:44996kB, file-rss:0kB, shmem-rss:2964956kB,
03 Feb 2023 00:07:41Z workload maintenance Reconciling runners`
Bizarre how it affected all the units.
I rebooted xlarge/2 first, then noticed there was an active runner and a test got scheduled.
None of the others recovered, I rebooted those, they also started registering as online.
The text was updated successfully, but these errors were encountered:
Bug Description
We've had reports that github-runner-operator deployed runners occasionally stop receiving jobs.
I did some investigation and discovered when oom-reaper is invoked and kills the VMs, the service does not recover.
I was able to get the runners to recover by rebooting the system; restarting the charm and LXD did not fix the issue.
To Reproduce
N/A; from my investigation it would appear this is caused by the runners running out of memory and oom-killer.
Environment
late/edge
Relevant log output
Additional context
From my investigation notes:
This turned out to be a memory issue.
oom-killer killed a bunch of the VMs on
xlarge/2*
xlarge/2
was stuck in the following state:Bizarre how it affected all the units.
I rebooted
xlarge/2
first, then noticed there was an active runner and a test got scheduled.None of the others recovered, I rebooted those, they also started registering as online.
The text was updated successfully, but these errors were encountered: