Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: tests regression on Ubuntu latest #2495

Closed
lucab opened this issue Dec 7, 2021 · 13 comments · Fixed by #2517
Closed

CI: tests regression on Ubuntu latest #2495

lucab opened this issue Dec 7, 2021 · 13 comments · Fixed by #2517

Comments

@lucab
Copy link
Member

lucab commented Dec 7, 2021

It looks like the github CI workflow on "Ubuntu Latest Release" is permanently failing now (full logs from a random PR attached):

2021-12-07T13:50:00.2489680Z ERROR: tests/test-sysroot-c - missing test plan
2021-12-07T13:50:00.2490763Z ERROR: tests/test-sysroot-c - exited with status 124
[...]
2021-12-07T13:50:03.7839111Z ERROR: tests/test-pull-c - missing test plan
2021-12-07T13:50:03.7840219Z ERROR: tests/test-pull-c - exited with status 124

Looking at the logs I'm now exactly sure what's going on. But possibly we are seeing a coredump or hang by the GPG agent when trying to stopping it?

2021-12-07T13:50:06.5383935Z ++ gpg-connect-agent --homedir /var/tmp/tap-test.Abay33/gpghome killagent /bye
2021-12-07T13:50:06.5385446Z gpg-connect-agent: no running gpg-agent - starting '/usr/bin/gpg-agent'
2021-12-07T13:50:06.5386725Z gpg-connect-agent: waiting for the agent to come up ... (5s)
2021-12-07T13:50:06.5387641Z gpg-connect-agent: connection to agent established
2021-12-07T13:50:06.5388341Z timeout: the monitored command dumped core

/cc @dbnicholson @smcv

@dbnicholson
Copy link
Member

Hmm. I can't tell you how many times I've had to fight with cleaning up gpg-agent... However, I don't think the problem here is gpg-agent and you can see it running the same process successfully at other parts of the log. It's just part of gpg-agent's design that the client (gpg-connect-agent) has to start gpg-agent even though the command you're asking it to run is killagent.

Furthermore, timeout isn't actually monitoring gpg-connect-agent. It's monitoring the test programs test-sysroot-c and test-pull-c. I think it's actually those programs that are dumping core. I can't tell why, though. Unless the call to ot_test_run_libtest is what's segfaulting or whatever, then I think the problem is elsewhere.

@lucab
Copy link
Member Author

lucab commented Dec 10, 2021

Indeed the abrupt end of log messages tricked me.

I'm able to see this locally on a ubuntu:rolling container, but I couldn't get to the bottom of it as I ended somewhere in odd glib land.
The unittest is indefinitely hanging, and timeout kills it after a bit.
Before it gets terminated, I can see it gets stuck this way:

(gdb) bt
#0  0x00007f02235b6912 in __GI___libc_read (fd=fd@entry=4, buf=buf@entry=0x7ffcfddf32c0, nbytes=nbytes@entry=8) at ../sysdeps/unix/sysv/linux/read.c:26
#1  0x00007f022376c28a in read (__nbytes=8, __buf=0x7ffcfddf32c0, __fd=4, __fd=<optimized out>, __buf=<optimized out>, __nbytes=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:47
#2  read_ints (fd=4, buf=buf@entry=0x7ffcfddf32c0, n_ints_in_buf=n_ints_in_buf@entry=2, n_ints_read=n_ints_read@entry=0x7ffcfddf31a8, error=error@entry=0x7ffcfddf3630) at ../../../glib/gspawn.c:1753
#3  0x00007f022376ccd3 in fork_exec
    (intermediate_child=intermediate_child@entry=0, working_directory=working_directory@entry=0x0, argv=<optimized out>, envp=envp@entry=0x0, close_descriptors=close_descriptors@entry=1, search_path=search_path@entry=1, search_path_from_envp=<optimized out>, stdout_to_null=<optimized out>, stderr_to_null=<optimized out>, child_inherits_stdin=<optimized out>, file_and_argv_zero=<optimized out>, cloexec_pipes=<optimized out>, child_setup=<optimized out>, user_data=<optimized out>, child_pid=<optimized out>, stdin_pipe_out=<optimized out>, stdout_pipe_out=<optimized out>, stderr_pipe_out=<optimized out>, stdin_fd=<optimized out>, stdout_fd=<optimized out>, stderr_fd=<optimized out>, source_fds=<optimized out>, target_fds=<optimized out>, n_fds=<optimized out>, error=<optimized out>) at ../../../glib/gspawn.c:2331
#4  0x00007f022376dca9 in g_spawn_sync
    (working_directory=working_directory@entry=0x0, argv=<optimized out>, envp=envp@entry=0x0, flags=flags@entry=G_SPAWN_SEARCH_PATH, child_setup=child_setup@entry=0x0, user_data=user_data@entry=0x0, standard_output=0x0, standard_error=0x0, exit_status=0x7ffcfddf35b4, error=0x7ffcfddf3630) at ../../../glib/gspawn.c:400
#5  0x00005636218b4229 in ot_test_run_libtest (cmd=cmd@entry=0x5636218b9218 "setup_os_repository \"archive\" \"syslinux\"", error=error@entry=0x7ffcfddf3630) at tests/libostreetest.c:49
#6  0x00005636218b45dc in ot_test_setup_sysroot (cancellable=cancellable@entry=0x0, error=error@entry=0x7ffcfddf3630) at tests/libostreetest.c:141
#7  0x00005636218b3e37 in main (argc=<optimized out>, argv=<optimized out>) at tests/test-sysroot-c.c:85

Which comes from here:

int estatus;
if (!g_spawn_sync (NULL, (char**)argv->pdata, NULL, G_SPAWN_SEARCH_PATH,
NULL, NULL, NULL, NULL, &estatus, error))
return FALSE;
if (!g_spawn_check_exit_status (estatus, error))
return FALSE;

And this happens while the underlying bash child has exited already:

|           \_ /ostree/tests/.libs/test-sysroot-c -k --tap
|               \_ [bash] <defunct>

So apparently this is g_spawn_sync blocked on a read from FD 4 (which seems to be a pipe to something), while the spawned child process already exited and needs to be reaped (which never happens as the parent never recovers from the blocked read).
According to package list, this is with libglib2.0-0:amd64 = 2.68.4-1ubuntu1.

@dbnicholson
Copy link
Member

@pwithnall any ideas what's happening here? I haven't tried to debug any further, but any pointers in the right direction would be helpful.

@dbnicholson
Copy link
Member

Actually, this looks a lot like https://gitlab.gnome.org/GNOME/glib/-/issues/2506. Those changes were backported to glib 2.70, but Ubuntu rolling/21.10/impish has 2.68.4. Maybe someone at Ubuntu just needs a poke to backport those changes.

@pwithnall
Copy link
Member

Actually, this looks a lot like https://gitlab.gnome.org/GNOME/glib/-/issues/2506. Those changes were backported to glib 2.70, but Ubuntu rolling/21.10/impish has 2.68.4. Maybe someone at Ubuntu just needs a poke to backport those changes.

Yeah, from Luca’s backtrace above that diagnosis fits. We’re not going to backport the fix to 2.68.x in upstream GLib since there will be no more 2.68 releases, but Ubuntu (and other distros) certainly should if they’re still shipping that version.

@dbnicholson
Copy link
Member

Reported to Ubuntu in https://bugs.launchpad.net/ubuntu/+source/glib2.0/+bug/1956059.

@dbnicholson
Copy link
Member

Well that's interesting. I went through the trouble of actually building glib with the gspawn patches backported and then installed it in an impish container to run the test suite. To my surprise it still failed in the same way with a defunct bash child that seems to have completed normally.

While fiddling around trying to get a core dump, I ran it with --privileged and the test (test-pull-c) passed. Also, when trying to boil down a reproducer, this seemed to be enough:

#include "config.h"
#include <glib.h>
#include "libostreetest.h"

static void
test_setup (gconstpointer data)
{
  GError *error = NULL;

  ot_test_run_libtest ("setup_fake_remote_repo1 archive", &error);
  g_assert_no_error (error);
}

int main (int argc, char *argv[])
{
  g_test_init (&argc, &argv, NULL);
  g_test_add_data_func ("/repro/setup", NULL, test_setup);
  return g_test_run ();
}

There's something about setup_fake_remote_repo1 and its siblings that causes the hang since other uses of ot_test_setup_repo work fine.

@dbnicholson
Copy link
Member

It passes with --security-opt seccomp=unconfined. So, presumably there's some (newer?) syscall that's not included in the seccomp profile and then some other code isn't checking the return value.

Ah - changing the seccomp profile to SCMP_ACT_LOG showed it was syscall 436, which is close_range. But close_range is in the allow list in the seccomp profile. WTF...

@dbnicholson
Copy link
Member

I think the problem is ultimately https://bugs.launchpad.net/ubuntu/+source/libseccomp/+bug/1944436. The ubuntu-latest VM is 20.04. Docker on the azure ubuntu VMs uses the system libseccomp and the libseccomp on focal doesn't actually know that close_range is syscall 436.

So, libseccomp causes close_range to error with EPERM, but gspawn only falls back to setting CLOEXEC manually when the error is ENOSYS or EINVAL. @pwithnall it seems like gspawn should either fall back unconditionally there or return an error from safe_fdwalk_set_cloexec.

dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022
The ubuntu-latest VMs are currently based on 20.04 (focal). In focal,
libseccomp2 doesn't know about the close_range syscall[1], but
g_spawn_sync in impish tries to use close_range since it's defined in
glibc. That causes libseccomp2 to return EPERM as it does for any
unknown syscalls. g_spawn_sync carries on silently instead of falling
back to other means of setting CLOEXEC on open FDs. Eventually it causes
some tests to hang since once side of a pipe is never closed. Remove
this when libseccomp2 in focal is updated or glib in impish handles the
EPERM better.

1. https://bugs.launchpad.net/ubuntu/+source/libseccomp/+bug/1944436

Fixes: ostreedev#2495
@dbnicholson
Copy link
Member

#2517 provides a workaround by disabling seccomp for ubuntu:rolling. I don't think there's any way to adjust the seccomp profile (such as making it return ENOSYS) since libseccomp doesn't know what syscall 436 is and you can't use numeric values.

@smcv
Copy link
Contributor

smcv commented Jan 13, 2022

So, libseccomp causes close_range to error with EPERM

My understanding is that this is completely under the caller's control, so it's the caller (Docker) that has decided to use EPERM. This was a long-standing design issue in Docker: the Linux kernel ABI is that syscalls unknown to the system should fail with ENOSYS, allowing callers to fall back to an older interface that the system does understand.

@cgwalters
Copy link
Member

I think this is fixed in recent docker. It was definitely fixed in recent podman.

@dbnicholson
Copy link
Member

@smcv Good point. FWIW, the default docker seccomp profile is here and documented here. The default profile still returns EPERM as defined by "defaultAction": "SCMP_ACT_ERRNO" and "defaultErrnoRet": 1.

I tried changing that to "defaultErrnoRet": 38 to get ENOSYS, but it didn't seem to do help. Unfortunately I think I've sunk too much time into this already to go further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants