Memory constrained GitHub Runners stuck in idle #30

tcuthbert · 2023-02-06T01:18:04Z

Bug Description

We've had reports that github-runner-operator deployed runners occasionally stop receiving jobs.
I did some investigation and discovered when oom-reaper is invoked and kills the VMs, the service does not recover.
I was able to get the runners to recover by rebooting the system; restarting the charm and LXD did not fix the issue.

To Reproduce

N/A; from my investigation it would appear this is caused by the runners running out of memory and oom-killer.

Environment

late/edge

"charm": "github-runner",
"series": "focal",
"os": "ubuntu",
"charm-origin": "charmhub",
"charm-name": "github-runner",
"charm-rev": 4,
"charm-channel": "edge",
"charm-version": "45cfce1",
"exposed": false,

Relevant log output

https://pastebin.ubuntu.com/p/vQ887s42J8/

Additional context

From my investigation notes:

This turned out to be a memory issue.
oom-killer killed a bunch of the VMs on xlarge/2*
xlarge/2 was stuck in the following state:

[Thu Feb  2 15:07:36 2023] Out of memory: Killed process 3018113 (qemu-system-x86) total-vm:17906732kB, anon-rss:44264kB, file-rss:0kB, shmem-rss:6236532kB,
[Thu Feb  2 18:36:52 2023] Out of memory: Killed process 3016510 (qemu-system-x86) total-vm:17915952kB, anon-rss:43820kB, file-rss:0kB, shmem-rss:5447228kB,
[Thu Feb  2 22:25:03 2023] Out of memory: Killed process 548735 (qemu-system-x86) total-vm:18131036kB, anon-rss:45532kB, file-rss:0kB, shmem-rss:4885312kB,
[Thu Feb  2 22:41:01 2023] Out of memory: Killed process 547841 (qemu-system-x86) total-vm:18166784kB, anon-rss:48156kB, file-rss:0kB, shmem-rss:4816036kB,
[Thu Feb  2 23:05:39 2023] Out of memory: Killed process 2953011 (qemu-system-x86) total-vm:18049308kB, anon-rss:43928kB, file-rss:0kB, shmem-rss:2984476kB,
[Thu Feb  2 23:59:43 2023] Out of memory: Killed process 2951372 (qemu-system-x86) total-vm:18185036kB, anon-rss:44996kB, file-rss:0kB, shmem-rss:2964956kB,

03 Feb 2023 00:07:41Z  workload   maintenance  Reconciling runners`

Bizarre how it affected all the units.
I rebooted xlarge/2 first, then noticed there was an active runner and a test got scheduled.
None of the others recovered, I rebooted those, they also started registering as online.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory constrained GitHub Runners stuck in idle #30

Memory constrained GitHub Runners stuck in idle #30

tcuthbert commented Feb 6, 2023

Memory constrained GitHub Runners stuck in idle #30

Memory constrained GitHub Runners stuck in idle #30

Comments

tcuthbert commented Feb 6, 2023

Bug Description

To Reproduce

Environment

Relevant log output

Additional context