Skip to content
This repository has been archived by the owner on May 7, 2021. It is now read-only.

kola: Add --max-machines #1161

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

cgwalters
Copy link
Member

This is only implemented for qemu at the moment, though it'd
be a mostly mechanical change to propagate it to the other
providers.

For our pipeline testing, we need to have a hard cap on the number
of qemu instances we spawn, otherwise we can go over the RAM
allocated to the pod.

Actually the FCOS pipeline today doesn't impose a hard cap, and
my test pipeline in the coreosci (nested GCP virt) ended up bringing
down the node via the OOM killer.

There were a few bugs here; first we were leaking the spawned
qemu instance. We also need to invoke Wait() synchronously in
destruction.

Then, add a dependency on the golang/x/semaphore library, and
use it to implement a max limit.

Closes: https://github.com/coreos/mantle/issues/1157

@cgwalters cgwalters force-pushed the machine-max branch 2 times, most recently from ffb769c to f1a758f Compare January 14, 2020 15:08
@cgwalters
Copy link
Member Author

Tweaked this to default --max-machines to the value of --parallel.

@arithx
Copy link
Contributor

arithx commented Jan 14, 2020

A couple questions:

What happens if:

  • The max-machine count is less than the required amount of machines for a given test (e.x. the test needs 3 machines and max-machines is 2
  • We're running tests serially with a max-machines count of 1 and we hit a test like the NFS test that spawns machines inside of the test.

@jlebon
Copy link
Member

jlebon commented Jan 14, 2020

Nice, good catch!

What happens if:

Probably we should automatically skip tests that have a required machine count > --max-machines? And then have tests like the NFS one actually declare how many machines it intends to use. Seems good practice anyway for e.g. estimating resource needs when allocating the pod that will run kola.

I think we could at least get in the leaks and fixes for now, right? That alone should greatly reduce memory stress on the nodes we're allocated on. WDYT about splitting those out as a separate PR?

@cgwalters
Copy link
Member Author

Probably we should automatically skip tests that have a required machine count > --max-machines? And then have tests like the NFS one actually declare how many machines it intends to use.

Yeah, something like that.

WDYT about splitting those out as a separate PR?

OK done in

#1163

But, I think we'll need this PR in order to make the pipeline truly reliable in the face of hard memory caps.

This is only implemented for qemu at the moment, though it'd
be a mostly mechanical change to propagate it to the other
providers.

For our pipeline testing, we need to have a hard cap on the number
of qemu instances we spawn, otherwise we can go over the RAM
allocated to the pod.

Actually the FCOS pipeline today doesn't impose a hard cap, and
my test pipeline in the coreosci (nested GCP virt) ended up bringing
down the node via the OOM killer.

There were a few bugs here; first we were leaking the spawned
qemu instance.  We also need to invoke `Wait()` synchronously in
destruction.

Then, add a dependency on the `golang/x/semaphore` library, and
use it to implement a max limit.

Closes: https://github.com/coreos/mantle/issues/1157
@ashcrow
Copy link
Member

ashcrow commented May 12, 2020

Probably we should automatically skip tests that have a required machine count > --max-machines? And then have tests like the NFS one actually declare how many machines it intends to use.

Yeah, something like that.

Added a few reviewers, though it sounds like the above is needed so we avoid timeouts before this is ready. Is it fair to mark this WIP for now?

@ashcrow
Copy link
Member

ashcrow commented May 12, 2020

And I think this might might sense to be redirected to the cosa repo.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kola --parallel broken with qemu
4 participants