Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coredump in c10s testing farm #1900

Closed
jelly opened this issue Nov 8, 2024 · 3 comments · Fixed by #1909
Closed

coredump in c10s testing farm #1900

jelly opened this issue Nov 8, 2024 · 3 comments · Fixed by #1909
Assignees

Comments

@jelly
Copy link
Member

jelly commented Nov 8, 2024

See for example this PR and check the console log:

[  267.171160] coredump: 8261(browser.sh): Unsafe core_pattern used with fs.suid_dumpable=2: pipe handler or fully qualified core dump path required. Set kernel.core_pattern before fs.suid_dumpable.

https://artifacts.dev.testing-farm.io/2790ec3f-7a63-4dcf-b98d-b62519ce604c/work-storagerwkkep52/console-97fac4f4-dcc6-4fb3-a90b-6c420c4f0a6a.log

To debug this, remember that we can reserve a test VM with the testing-farm utility.

@jelly jelly added this to Pilot tasks Nov 8, 2024
@martinpitt martinpitt moved this to detriment in Pilot tasks Nov 10, 2024
@jelly
Copy link
Member Author

jelly commented Nov 12, 2024

Most recent test runs have not shown a coredump but we still see the "ssh connection closed" issue during a test run of the storage tests:

To try to reproduce the issue I have reserved a VM and then ran:

dnf install -y podman cockpit-system cockpit-ws cockpit-bridge cockpit-machines virt-install dbus-tools firewalld  libvirt-daemon-driver-storage-iscsi libvirt-daemon-driver-storage-logical git
git clone https://github.com/cockpit-project/cockpit-machines.git
cd cockpit-machines

mkdir -p /root/.ssh
curl https://raw.githubusercontent.com/cockpit-project/bots/main/machine/identity.pub  >> /root/.ssh/authorized_keys
chmod 600 /root/.ssh/authorized_keys

    useradd -c Administrator -G wheel admin
    echo admin:foobar | chpasswd
echo root:foobar | chpasswd
su -c 'echo foobar | sudo --stdin whoami' - admin

podman pull ghcr.io/cockpit-project/tasks:2024-10-07
podman run --rm --shm-size=1024m --security-opt=label=disable --network=host --volume=/data:/logs:rw,U --env=LOGS=/logs --volume="$(pwd)":/source:rw,U --env=SOURCE=/source --volume=/usr/lib/os-release:/run/host/usr/lib/os-release:ro -ti ghcr.io/cockpit-project/tasks:2024-10-07 bash

TEST_OS=centos-10 TEST_BROWSER=firefox ./test/check-machines-disks -vst TestMachinesDisks.testDisks --machine localhost:22 --browser localhost:9090

To run tests for some reason I have to start virtnetworkd and virtstoraged as for some reason the test does not start them.

Attempted this three times and it cleanly disconnected randomly during running, I have not recorded the uptime of the VM.

The fourth time I started a tmux to curl --head my own website to see if the machine dies or sshd dies and when my ssh connection was cut the curl HEAD requests also stopped so the machine really goes offline. We also can't ping these machines from the outside (as we ssh via a jumphost into them).

Open questions:

  • How long do the other test scenarios run?
  • Does AWS have some watchdog which is triggered and kills the VM?
  • Any way to obtain guest logs after the fact? Or to "keep" a VM?
  • How is CentOS 10 different from our normal image.

@jelly
Copy link
Member Author

jelly commented Nov 13, 2024

I've reserved a test machine yesterday and just let it hang around, wasn't killed after > 1 hour so that theory does not hold up. Also note that the test ssh disconnect happens quite "fast" if the logs are to believed this failed in 240 seconds.

So I'm currently running out of ideas here on what could be the issue of the maching going away while running tests.

@martinpitt
Copy link
Member

martinpitt commented Nov 15, 2024

I found a kernel oops in #1909 (comment) -- there's at least a chance that this is the same issue. Would match the symptoms! And I can reproduce it with the centos-10 image rebuild.

@martinpitt martinpitt self-assigned this Nov 15, 2024
martinpitt added a commit to martinpitt/cockpit-machines that referenced this issue Nov 15, 2024
RHEL 10 got a nasty kernel oops [1] which unceremoniously reboots the VM
without leaving any journal trace, only the QEMU console shows it.

It completely breaks Testing Farm runs (it cannot recover from reboots)
and also breaks our own CI in nasty ways, as after the reboot the
nondestructive recovery fails in all kinds of ways.

Fixes cockpit-project#1900

[1] https://issues.redhat.com/browse/RHEL-67841
martinpitt added a commit to martinpitt/cockpit-machines that referenced this issue Nov 15, 2024
RHEL 10 got a nasty kernel oops [1] which unceremoniously reboots the VM
without leaving any journal trace, only the QEMU console shows it.

It completely breaks Testing Farm runs (it cannot recover from reboots)
and also breaks our own CI in nasty ways, as after the reboot the
nondestructive recovery fails in all kinds of ways.

Fixes cockpit-project#1900

[1] https://issues.redhat.com/browse/RHEL-67841
@jelly jelly closed this as completed in 423d16a Nov 16, 2024
@github-project-automation github-project-automation bot moved this from detriment to improvement in Pilot tasks Nov 16, 2024
@martinpitt martinpitt moved this from improvement to detriment in Pilot tasks Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants