Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neonvm: add readiness probe for sysfs scaling #1190

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mikhail-sakhnov
Copy link
Contributor

@mikhail-sakhnov mikhail-sakhnov commented Dec 30, 2024

Call runner /ready which, in sysfs scaling mode, proxifies to daemon's /cpu to check if runner pod and vm is ready. Runner's endpoint /ready does nothing in case of qmp scaling model.
Move neonvm-daemon line in the inittab to start it right before vmstart.

It is a blocker for #1141 because working without ready probe introduces race in some e2e tests and kuttl tries to execute shell command kill while VM is not fully booted and acpid rules are not in action yet.

It closes #1147 which is in the backlog but since it is a blocker I extracted it from the PR #1141

Copy link

github-actions bot commented Dec 30, 2024

No changes to the coverage.

HTML Report

Click to open

@mikhail-sakhnov mikhail-sakhnov force-pushed the misha/add-ready-probe-for-sysfs-scaling branch from e5c9385 to 249d38b Compare December 30, 2024 14:01
Call runner /ready which, in sysfs scaling mode, proxifies to daemon's /cpu to
check if runner pod and vm is ready. Runner's endpoint /ready does nothing in
case of qmp scaling model.
Move neonvm-daemon line in the inittab to start it right before vmstart.

Signed-off-by: Mikhail Sakhnov <[email protected]>
@mikhail-sakhnov mikhail-sakhnov force-pushed the misha/add-ready-probe-for-sysfs-scaling branch from 249d38b to 1ea3177 Compare January 7, 2025 15:06
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, left a couple questions

::respawn:/neonvm/bin/udhcpc -t 1 -T 1 -A 1 -f -i eth0 -O 121 -O 119 -s /neonvm/bin/udhcpc.script
::respawn:/neonvm/bin/udevd
::wait:/neonvm/bin/udev-init.sh
::respawn:/neonvm/bin/acpid -f -c /neonvm/acpi
::respawn:/neonvm/bin/vector -c /neonvm/config/vector.yaml --config-dir /etc/vector --color never
::respawn:/neonvm/bin/chronyd -n -f /neonvm/config/chrony.conf -l /var/log/chrony/chrony.log
::respawn:/neonvm/bin/sshd -E /var/log/ssh.log -f /neonvm/config/sshd_config
::respawn:/neonvm/bin/neonvmd --addr=0.0.0.0:25183
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought I covered it in the PR description.
We rely on neonvmd as a source of info for the readiness check.
For example if we start neonvmd first in the VM it might be that neonvmd is ready and serving requests but acpid or udevd have not yet even started.

default:
panic(fmt.Errorf("unknown pod phase: %q", pod.Status.Phase))
}
}

// isRunnerPodReady returns whether the runner pod is ready respecting the readiness probe of its containers.
func isRunnerPodReady(pod *corev1.Pod) runnerStatusKind {
if pod.Status.ContainerStatuses == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is there a difference between == nil vs len(...) == 0 ? I was a little surprised to see this check on a slice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You right, I don't think there is any difference for a slice if we use == nil or len(...) == 0

@@ -1443,6 +1463,21 @@ func podSpec(
},
}

if withReadinessProbe {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for the live migration, but I don't like this approach. I pushed that version to discuss what should be the correct approach for the live migrations. The thing is that qemu in target pod doesn't boot until migration is performed. But the migration is not performed unless pod is ready. There are 2 ways to tackle the issue:

  • either change ready probe in target pods (the line I left comment for does that) but we can't change readyiness probe on running pod so this pod will be forever without readiness probe. I found this suboptimal tbh - some pods will have probe, some won'y
  • change the logic in live migration controller and don't wait for the pod readiness

@sharnoff do you see any better options ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add readiness probe for the vm-runner process
2 participants