kola: Add --max-machines #1161

cgwalters · 2020-01-14T14:57:11Z

This is only implemented for qemu at the moment, though it'd
be a mostly mechanical change to propagate it to the other
providers.

For our pipeline testing, we need to have a hard cap on the number
of qemu instances we spawn, otherwise we can go over the RAM
allocated to the pod.

Actually the FCOS pipeline today doesn't impose a hard cap, and
my test pipeline in the coreosci (nested GCP virt) ended up bringing
down the node via the OOM killer.

There were a few bugs here; first we were leaking the spawned
qemu instance. We also need to invoke Wait() synchronously in
destruction.

Then, add a dependency on the golang/x/semaphore library, and
use it to implement a max limit.

Closes: https://github.com/coreos/mantle/issues/1157

cgwalters · 2020-01-14T15:12:07Z

Tweaked this to default --max-machines to the value of --parallel.

platform/machine/unprivqemu/cluster.go

arithx · 2020-01-14T16:16:11Z

A couple questions:

What happens if:

The max-machine count is less than the required amount of machines for a given test (e.x. the test needs 3 machines and max-machines is 2
We're running tests serially with a max-machines count of 1 and we hit a test like the NFS test that spawns machines inside of the test.

jlebon · 2020-01-14T17:29:10Z

Nice, good catch!

What happens if:

Probably we should automatically skip tests that have a required machine count > --max-machines? And then have tests like the NFS one actually declare how many machines it intends to use. Seems good practice anyway for e.g. estimating resource needs when allocating the pod that will run kola.

I think we could at least get in the leaks and fixes for now, right? That alone should greatly reduce memory stress on the nodes we're allocated on. WDYT about splitting those out as a separate PR?

cgwalters · 2020-01-14T17:43:25Z

Probably we should automatically skip tests that have a required machine count > --max-machines? And then have tests like the NFS one actually declare how many machines it intends to use.

Yeah, something like that.

WDYT about splitting those out as a separate PR?

OK done in

#1163

But, I think we'll need this PR in order to make the pipeline truly reliable in the face of hard memory caps.

This is only implemented for qemu at the moment, though it'd be a mostly mechanical change to propagate it to the other providers. For our pipeline testing, we need to have a hard cap on the number of qemu instances we spawn, otherwise we can go over the RAM allocated to the pod. Actually the FCOS pipeline today doesn't impose a hard cap, and my test pipeline in the coreosci (nested GCP virt) ended up bringing down the node via the OOM killer. There were a few bugs here; first we were leaking the spawned qemu instance. We also need to invoke `Wait()` synchronously in destruction. Then, add a dependency on the `golang/x/semaphore` library, and use it to implement a max limit. Closes: https://github.com/coreos/mantle/issues/1157

ashcrow · 2020-05-12T21:11:05Z

Probably we should automatically skip tests that have a required machine count > --max-machines? And then have tests like the NFS one actually declare how many machines it intends to use.

Yeah, something like that.

Added a few reviewers, though it sounds like the above is needed so we avoid timeouts before this is ready. Is it fair to mark this WIP for now?

ashcrow · 2020-05-12T21:13:49Z

And I think this might might sense to be redirected to the cosa repo.

cgwalters force-pushed the machine-max branch 2 times, most recently from ffb769c to f1a758f Compare January 14, 2020 15:08

cgwalters commented Jan 14, 2020

View reviewed changes

platform/machine/unprivqemu/cluster.go Outdated Show resolved Hide resolved

cgwalters force-pushed the machine-max branch from f1a758f to e2e5a03 Compare January 15, 2020 00:37

ashcrow requested review from arithx and bgilbert May 12, 2020 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kola: Add --max-machines #1161

kola: Add --max-machines #1161

cgwalters commented Jan 14, 2020

cgwalters commented Jan 14, 2020

arithx commented Jan 14, 2020

jlebon commented Jan 14, 2020

cgwalters commented Jan 14, 2020

ashcrow commented May 12, 2020

ashcrow commented May 12, 2020

kola: Add --max-machines #1161

Are you sure you want to change the base?

kola: Add --max-machines #1161

Conversation

cgwalters commented Jan 14, 2020

cgwalters commented Jan 14, 2020

arithx commented Jan 14, 2020

jlebon commented Jan 14, 2020

cgwalters commented Jan 14, 2020

ashcrow commented May 12, 2020

ashcrow commented May 12, 2020