-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coredump in c10s testing farm #1900
Comments
Most recent test runs have not shown a coredump but we still see the "ssh connection closed" issue during a test run of the storage tests: To try to reproduce the issue I have reserved a VM and then ran:
To run tests for some reason I have to start Attempted this three times and it cleanly disconnected randomly during running, I have not recorded the uptime of the VM. The fourth time I started a tmux to Open questions:
|
I've reserved a test machine yesterday and just let it hang around, wasn't killed after > 1 hour so that theory does not hold up. Also note that the test ssh disconnect happens quite "fast" if the logs are to believed this failed in 240 seconds. So I'm currently running out of ideas here on what could be the issue of the maching going away while running tests. |
I found a kernel oops in #1909 (comment) -- there's at least a chance that this is the same issue. Would match the symptoms! And I can reproduce it with the centos-10 image rebuild. |
RHEL 10 got a nasty kernel oops [1] which unceremoniously reboots the VM without leaving any journal trace, only the QEMU console shows it. It completely breaks Testing Farm runs (it cannot recover from reboots) and also breaks our own CI in nasty ways, as after the reboot the nondestructive recovery fails in all kinds of ways. Fixes cockpit-project#1900 [1] https://issues.redhat.com/browse/RHEL-67841
RHEL 10 got a nasty kernel oops [1] which unceremoniously reboots the VM without leaving any journal trace, only the QEMU console shows it. It completely breaks Testing Farm runs (it cannot recover from reboots) and also breaks our own CI in nasty ways, as after the reboot the nondestructive recovery fails in all kinds of ways. Fixes cockpit-project#1900 [1] https://issues.redhat.com/browse/RHEL-67841
See for example this PR and check the console log:
https://artifacts.dev.testing-farm.io/2790ec3f-7a63-4dcf-b98d-b62519ce604c/work-storagerwkkep52/console-97fac4f4-dcc6-4fb3-a90b-6c420c4f0a6a.log
To debug this, remember that we can reserve a test VM with the testing-farm utility.
The text was updated successfully, but these errors were encountered: