coredump in c10s testing farm #1900

jelly · 2024-11-08T14:30:29Z

See for example this PR and check the console log:

[  267.171160] coredump: 8261(browser.sh): Unsafe core_pattern used with fs.suid_dumpable=2: pipe handler or fully qualified core dump path required. Set kernel.core_pattern before fs.suid_dumpable.

https://artifacts.dev.testing-farm.io/2790ec3f-7a63-4dcf-b98d-b62519ce604c/work-storagerwkkep52/console-97fac4f4-dcc6-4fb3-a90b-6c420c4f0a6a.log

To debug this, remember that we can reserve a test VM with the testing-farm utility.

The text was updated successfully, but these errors were encountered:

jelly · 2024-11-12T13:59:02Z

Most recent test runs have not shown a coredump but we still see the "ssh connection closed" issue during a test run of the storage tests:

To try to reproduce the issue I have reserved a VM and then ran:

dnf install -y podman cockpit-system cockpit-ws cockpit-bridge cockpit-machines virt-install dbus-tools firewalld  libvirt-daemon-driver-storage-iscsi libvirt-daemon-driver-storage-logical git
git clone https://github.com/cockpit-project/cockpit-machines.git
cd cockpit-machines

mkdir -p /root/.ssh
curl https://raw.githubusercontent.com/cockpit-project/bots/main/machine/identity.pub  >> /root/.ssh/authorized_keys
chmod 600 /root/.ssh/authorized_keys

    useradd -c Administrator -G wheel admin
    echo admin:foobar | chpasswd
echo root:foobar | chpasswd
su -c 'echo foobar | sudo --stdin whoami' - admin

podman pull ghcr.io/cockpit-project/tasks:2024-10-07
podman run --rm --shm-size=1024m --security-opt=label=disable --network=host --volume=/data:/logs:rw,U --env=LOGS=/logs --volume="$(pwd)":/source:rw,U --env=SOURCE=/source --volume=/usr/lib/os-release:/run/host/usr/lib/os-release:ro -ti ghcr.io/cockpit-project/tasks:2024-10-07 bash

TEST_OS=centos-10 TEST_BROWSER=firefox ./test/check-machines-disks -vst TestMachinesDisks.testDisks --machine localhost:22 --browser localhost:9090

To run tests for some reason I have to start virtnetworkd and virtstoraged as for some reason the test does not start them.

Attempted this three times and it cleanly disconnected randomly during running, I have not recorded the uptime of the VM.

The fourth time I started a tmux to curl --head my own website to see if the machine dies or sshd dies and when my ssh connection was cut the curl HEAD requests also stopped so the machine really goes offline. We also can't ping these machines from the outside (as we ssh via a jumphost into them).

Open questions:

How long do the other test scenarios run?
Does AWS have some watchdog which is triggered and kills the VM?
Any way to obtain guest logs after the fact? Or to "keep" a VM?
How is CentOS 10 different from our normal image.

jelly · 2024-11-13T09:29:33Z

I've reserved a test machine yesterday and just let it hang around, wasn't killed after > 1 hour so that theory does not hold up. Also note that the test ssh disconnect happens quite "fast" if the logs are to believed this failed in 240 seconds.

So I'm currently running out of ideas here on what could be the issue of the maching going away while running tests.

martinpitt · 2024-11-15T15:26:09Z

I found a kernel oops in #1909 (comment) -- there's at least a chance that this is the same issue. Would match the symptoms! And I can reproduce it with the centos-10 image rebuild.

RHEL 10 got a nasty kernel oops [1] which unceremoniously reboots the VM without leaving any journal trace, only the QEMU console shows it. It completely breaks Testing Farm runs (it cannot recover from reboots) and also breaks our own CI in nasty ways, as after the reboot the nondestructive recovery fails in all kinds of ways. Fixes cockpit-project#1900 [1] https://issues.redhat.com/browse/RHEL-67841

jelly added this to Pilot tasks Nov 8, 2024

martinpitt moved this to detriment in Pilot tasks Nov 10, 2024

martinpitt mentioned this issue Nov 15, 2024

images: Move rhel-10-0 from beta to final cockpit-project/bots#7095

Merged

3 tasks

martinpitt assigned martinpitt and unassigned martinpitt Nov 15, 2024

martinpitt mentioned this issue Nov 15, 2024

testAddDiskNFS troubleshooting #1909

Merged

martinpitt self-assigned this Nov 15, 2024

jelly closed this as completed in 423d16a Nov 16, 2024

jelly closed this as completed in #1909 Nov 16, 2024

github-project-automation bot moved this from detriment to improvement in Pilot tasks Nov 16, 2024

martinpitt moved this from improvement to detriment in Pilot tasks Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coredump in c10s testing farm #1900

coredump in c10s testing farm #1900

jelly commented Nov 8, 2024

jelly commented Nov 12, 2024

jelly commented Nov 13, 2024

martinpitt commented Nov 15, 2024 •

edited

Loading

coredump in c10s testing farm #1900

coredump in c10s testing farm #1900

Comments

jelly commented Nov 8, 2024

jelly commented Nov 12, 2024

jelly commented Nov 13, 2024

martinpitt commented Nov 15, 2024 • edited Loading

martinpitt commented Nov 15, 2024 •

edited

Loading