CI: tests regression on Ubuntu latest #2495

lucab · 2021-12-07T15:03:20Z

It looks like the github CI workflow on "Ubuntu Latest Release" is permanently failing now (full logs from a random PR attached):

2021-12-07T13:50:00.2489680Z ERROR: tests/test-sysroot-c - missing test plan
2021-12-07T13:50:00.2490763Z ERROR: tests/test-sysroot-c - exited with status 124
[...]
2021-12-07T13:50:03.7839111Z ERROR: tests/test-pull-c - missing test plan
2021-12-07T13:50:03.7840219Z ERROR: tests/test-pull-c - exited with status 124

Looking at the logs I'm now exactly sure what's going on. But possibly we are seeing a coredump or hang by the GPG agent when trying to stopping it?

2021-12-07T13:50:06.5383935Z ++ gpg-connect-agent --homedir /var/tmp/tap-test.Abay33/gpghome killagent /bye
2021-12-07T13:50:06.5385446Z gpg-connect-agent: no running gpg-agent - starting '/usr/bin/gpg-agent'
2021-12-07T13:50:06.5386725Z gpg-connect-agent: waiting for the agent to come up ... (5s)
2021-12-07T13:50:06.5387641Z gpg-connect-agent: connection to agent established
2021-12-07T13:50:06.5388341Z timeout: the monitored command dumped core

/cc @dbnicholson @smcv

The text was updated successfully, but these errors were encountered:

dbnicholson · 2021-12-07T19:38:12Z

Hmm. I can't tell you how many times I've had to fight with cleaning up gpg-agent... However, I don't think the problem here is gpg-agent and you can see it running the same process successfully at other parts of the log. It's just part of gpg-agent's design that the client (gpg-connect-agent) has to start gpg-agent even though the command you're asking it to run is killagent.

Furthermore, timeout isn't actually monitoring gpg-connect-agent. It's monitoring the test programs test-sysroot-c and test-pull-c. I think it's actually those programs that are dumping core. I can't tell why, though. Unless the call to ot_test_run_libtest is what's segfaulting or whatever, then I think the problem is elsewhere.

lucab · 2021-12-10T11:27:04Z

Indeed the abrupt end of log messages tricked me.

I'm able to see this locally on a ubuntu:rolling container, but I couldn't get to the bottom of it as I ended somewhere in odd glib land.
The unittest is indefinitely hanging, and timeout kills it after a bit.
Before it gets terminated, I can see it gets stuck this way:

(gdb) bt
#0  0x00007f02235b6912 in __GI___libc_read (fd=fd@entry=4, buf=buf@entry=0x7ffcfddf32c0, nbytes=nbytes@entry=8) at ../sysdeps/unix/sysv/linux/read.c:26
#1  0x00007f022376c28a in read (__nbytes=8, __buf=0x7ffcfddf32c0, __fd=4, __fd=<optimized out>, __buf=<optimized out>, __nbytes=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:47
#2  read_ints (fd=4, buf=buf@entry=0x7ffcfddf32c0, n_ints_in_buf=n_ints_in_buf@entry=2, n_ints_read=n_ints_read@entry=0x7ffcfddf31a8, error=error@entry=0x7ffcfddf3630) at ../../../glib/gspawn.c:1753
#3  0x00007f022376ccd3 in fork_exec
    (intermediate_child=intermediate_child@entry=0, working_directory=working_directory@entry=0x0, argv=<optimized out>, envp=envp@entry=0x0, close_descriptors=close_descriptors@entry=1, search_path=search_path@entry=1, search_path_from_envp=<optimized out>, stdout_to_null=<optimized out>, stderr_to_null=<optimized out>, child_inherits_stdin=<optimized out>, file_and_argv_zero=<optimized out>, cloexec_pipes=<optimized out>, child_setup=<optimized out>, user_data=<optimized out>, child_pid=<optimized out>, stdin_pipe_out=<optimized out>, stdout_pipe_out=<optimized out>, stderr_pipe_out=<optimized out>, stdin_fd=<optimized out>, stdout_fd=<optimized out>, stderr_fd=<optimized out>, source_fds=<optimized out>, target_fds=<optimized out>, n_fds=<optimized out>, error=<optimized out>) at ../../../glib/gspawn.c:2331
#4  0x00007f022376dca9 in g_spawn_sync
    (working_directory=working_directory@entry=0x0, argv=<optimized out>, envp=envp@entry=0x0, flags=flags@entry=G_SPAWN_SEARCH_PATH, child_setup=child_setup@entry=0x0, user_data=user_data@entry=0x0, standard_output=0x0, standard_error=0x0, exit_status=0x7ffcfddf35b4, error=0x7ffcfddf3630) at ../../../glib/gspawn.c:400
#5  0x00005636218b4229 in ot_test_run_libtest (cmd=cmd@entry=0x5636218b9218 "setup_os_repository \"archive\" \"syslinux\"", error=error@entry=0x7ffcfddf3630) at tests/libostreetest.c:49
#6  0x00005636218b45dc in ot_test_setup_sysroot (cancellable=cancellable@entry=0x0, error=error@entry=0x7ffcfddf3630) at tests/libostreetest.c:141
#7  0x00005636218b3e37 in main (argc=<optimized out>, argv=<optimized out>) at tests/test-sysroot-c.c:85

Which comes from here:

ostree/tests/libostreetest.c

Lines 48 to 53 in b56786c

    
           int estatus; 
        
           if (!g_spawn_sync (NULL, (char**)argv->pdata, NULL, G_SPAWN_SEARCH_PATH, 
        
                              NULL, NULL, NULL, NULL, &estatus, error)) 
        
             return FALSE; 
        
           if (!g_spawn_check_exit_status (estatus, error)) 
        
             return FALSE;

And this happens while the underlying bash child has exited already:

|           \_ /ostree/tests/.libs/test-sysroot-c -k --tap
|               \_ [bash] <defunct>

So apparently this is g_spawn_sync blocked on a read from FD 4 (which seems to be a pipe to something), while the spawned child process already exited and needs to be reaped (which never happens as the parent never recovers from the blocked read).
According to package list, this is with libglib2.0-0:amd64 = 2.68.4-1ubuntu1.

dbnicholson · 2021-12-21T23:10:29Z

@pwithnall any ideas what's happening here? I haven't tried to debug any further, but any pointers in the right direction would be helpful.

dbnicholson · 2021-12-21T23:20:45Z

Actually, this looks a lot like https://gitlab.gnome.org/GNOME/glib/-/issues/2506. Those changes were backported to glib 2.70, but Ubuntu rolling/21.10/impish has 2.68.4. Maybe someone at Ubuntu just needs a poke to backport those changes.

pwithnall · 2021-12-23T11:42:32Z

Actually, this looks a lot like https://gitlab.gnome.org/GNOME/glib/-/issues/2506. Those changes were backported to glib 2.70, but Ubuntu rolling/21.10/impish has 2.68.4. Maybe someone at Ubuntu just needs a poke to backport those changes.

Yeah, from Luca’s backtrace above that diagnosis fits. We’re not going to backport the fix to 2.68.x in upstream GLib since there will be no more 2.68 releases, but Ubuntu (and other distros) certainly should if they’re still shipping that version.

dbnicholson · 2021-12-30T17:56:36Z

Reported to Ubuntu in https://bugs.launchpad.net/ubuntu/+source/glib2.0/+bug/1956059.

dbnicholson · 2022-01-12T22:30:41Z

Well that's interesting. I went through the trouble of actually building glib with the gspawn patches backported and then installed it in an impish container to run the test suite. To my surprise it still failed in the same way with a defunct bash child that seems to have completed normally.

While fiddling around trying to get a core dump, I ran it with --privileged and the test (test-pull-c) passed. Also, when trying to boil down a reproducer, this seemed to be enough:

#include "config.h"
#include <glib.h>
#include "libostreetest.h"

static void
test_setup (gconstpointer data)
{
  GError *error = NULL;

  ot_test_run_libtest ("setup_fake_remote_repo1 archive", &error);
  g_assert_no_error (error);
}

int main (int argc, char *argv[])
{
  g_test_init (&argc, &argv, NULL);
  g_test_add_data_func ("/repro/setup", NULL, test_setup);
  return g_test_run ();
}

There's something about setup_fake_remote_repo1 and its siblings that causes the hang since other uses of ot_test_setup_repo work fine.

dbnicholson · 2022-01-12T23:19:01Z

It passes with --security-opt seccomp=unconfined. So, presumably there's some (newer?) syscall that's not included in the seccomp profile and then some other code isn't checking the return value.

Ah - changing the seccomp profile to SCMP_ACT_LOG showed it was syscall 436, which is close_range. But close_range is in the allow list in the seccomp profile. WTF...

dbnicholson · 2022-01-12T23:49:26Z

I think the problem is ultimately https://bugs.launchpad.net/ubuntu/+source/libseccomp/+bug/1944436. The ubuntu-latest VM is 20.04. Docker on the azure ubuntu VMs uses the system libseccomp and the libseccomp on focal doesn't actually know that close_range is syscall 436.

So, libseccomp causes close_range to error with EPERM, but gspawn only falls back to setting CLOEXEC manually when the error is ENOSYS or EINVAL. @pwithnall it seems like gspawn should either fall back unconditionally there or return an error from safe_fdwalk_set_cloexec.

add more info... Fixes: ostreedev#2495

The ubuntu-latest VMs are currently based on 20.04 (focal). In focal, libseccomp2 doesn't know about the close_range syscall[1], but g_spawn_sync in impish tries to use close_range since it's defined in glibc. That causes libseccomp2 to return EPERM as it does for any unknown syscalls. g_spawn_sync carries on silently instead of falling back to other means of setting CLOEXEC on open FDs. Eventually it causes some tests to hang since once side of a pipe is never closed. Remove this when libseccomp2 in focal is updated or glib in impish handles the EPERM better. 1. https://bugs.launchpad.net/ubuntu/+source/libseccomp/+bug/1944436 Fixes: ostreedev#2495

dbnicholson · 2022-01-13T05:03:48Z

#2517 provides a workaround by disabling seccomp for ubuntu:rolling. I don't think there's any way to adjust the seccomp profile (such as making it return ENOSYS) since libseccomp doesn't know what syscall 436 is and you can't use numeric values.

smcv · 2022-01-13T09:39:25Z

So, libseccomp causes close_range to error with EPERM

My understanding is that this is completely under the caller's control, so it's the caller (Docker) that has decided to use EPERM. This was a long-standing design issue in Docker: the Linux kernel ABI is that syscalls unknown to the system should fail with ENOSYS, allowing callers to fall back to an older interface that the system does understand.

cgwalters · 2022-01-13T14:41:21Z

I think this is fixed in recent docker. It was definitely fixed in recent podman.

dbnicholson · 2022-01-13T17:56:13Z

@smcv Good point. FWIW, the default docker seccomp profile is here and documented here. The default profile still returns EPERM as defined by "defaultAction": "SCMP_ACT_ERRNO" and "defaultErrnoRet": 1.

I tried changing that to "defaultErrnoRet": 38 to get ENOSYS, but it didn't seem to do help. Unfortunately I think I've sunk too much time into this already to go further.

lucab mentioned this issue Dec 10, 2021

tests: assert mandatory values are present #2498

Merged

dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022

github: Workaround glib/seccomp issue on Ubuntu impish

89a3510

add more info... Fixes: ostreedev#2495

dbnicholson mentioned this issue Jan 13, 2022

github: Workaround glib/seccomp issue on Ubuntu impish #2517

Merged

dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022

github: Workaround glib/seccomp issue on Ubuntu impish

07a61b1

add more info... Fixes: ostreedev#2495

dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022

github: Workaround glib/seccomp issue on Ubuntu impish

9ed83dc

add more info... Fixes: ostreedev#2495

dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022

github: Workaround glib/seccomp issue on Ubuntu impish

da8dfff

add more info... Fixes: ostreedev#2495

dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022

github: Workaround glib/seccomp issue on Ubuntu impish

6b5e22b

add more info... Fixes: ostreedev#2495

dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022

github: Workaround glib/seccomp issue on Ubuntu impish

7ba8d21

add more info... Fixes: ostreedev#2495

dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jan 13, 2022

github: Workaround glib/seccomp issue on Ubuntu impish

f2af90d

add more info... Fixes: ostreedev#2495

cgwalters closed this as completed in #2517 Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: tests regression on Ubuntu latest #2495

CI: tests regression on Ubuntu latest #2495

lucab commented Dec 7, 2021

dbnicholson commented Dec 7, 2021

lucab commented Dec 10, 2021

dbnicholson commented Dec 21, 2021

dbnicholson commented Dec 21, 2021

pwithnall commented Dec 23, 2021

dbnicholson commented Dec 30, 2021

dbnicholson commented Jan 12, 2022

dbnicholson commented Jan 12, 2022

dbnicholson commented Jan 12, 2022

dbnicholson commented Jan 13, 2022

smcv commented Jan 13, 2022

cgwalters commented Jan 13, 2022

dbnicholson commented Jan 13, 2022

CI: tests regression on Ubuntu latest #2495

CI: tests regression on Ubuntu latest #2495

Comments

lucab commented Dec 7, 2021

dbnicholson commented Dec 7, 2021

lucab commented Dec 10, 2021

dbnicholson commented Dec 21, 2021

dbnicholson commented Dec 21, 2021

pwithnall commented Dec 23, 2021

dbnicholson commented Dec 30, 2021

dbnicholson commented Jan 12, 2022

dbnicholson commented Jan 12, 2022

dbnicholson commented Jan 12, 2022

dbnicholson commented Jan 13, 2022

smcv commented Jan 13, 2022

cgwalters commented Jan 13, 2022

dbnicholson commented Jan 13, 2022