Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GAP package sglppow causes stuck subprocess and fgets() failed with errno 5 #4136

Open
benlorenz opened this issue Sep 23, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@benlorenz
Copy link
Member

Describe the bug
As reported on slack by @paemurru, since today when exiting julia we get a weird error message:

julia> exit()
fgets() failed with errno 5
Input/output error

It does not happen every time but quite often.

This seems to be caused by #3517, once I comment this line

     "sglppow",  # extend the small groups library

the issue disappears.

This error print is not from the julia process but from a (stuck?) singular subprocess that is started when Oscar is loaded:

julia> using Oscar
  ___   ____   ____    _    ____
 / _ \ / ___| / ___|  / \  |  _ \   |  Combining ANTIC, GAP, Polymake, Singular
| | | |\___ \| |     / _ \ | |_) |  |  Type "?Oscar" for more information
| |_| | ___) | |___ / ___ \|  _ <   |  Manual: https://docs.oscar-system.org
 \___/ |____/ \____/_/   \_\_| \_\  |  1.2.0-DEV #HEAD 55ea957 2024-09-22

shell> ps -f
UID        PID  PPID  C STIME TTY          TIME CMD
lorenz    4147  4146  0 Sep20 pts/18   00:00:00 bash
lorenz   22361  4147 26 17:55 pts/18   00:00:12 julia-latest --project=.
lorenz   22475 22361  0 17:55 pts/18   00:00:00 /usr/bin/singular -t /tmp/gaptempdirQdsGt8/sing.in
lorenz   22776 22361  0 17:55 pts/18   00:00:00 /bin/bash -c (ps -f) && true
lorenz   22777 22776  0 17:55 pts/18   00:00:00 ps -f

shell> cat /tmp/gaptempdirQdsGt8/sing.in
proc GAP_Done () { return ( "@" ) };
proc GAP_Apostrophe () { return ( "'" ) };
GAP_Done();

julia> exit()
fgets() failed with errno 5
Input/output error

I do not see this subprocess once I remove the sglppow package from Oscar.jl.

Attaching strace to the singular subprocess shows that this is indeed the source of this message:

...
17:46:35.462538 read(0, 0x563fb8d25810, 1024) = -1 EIO (Input/output error)
17:46:38.640469 write(2, "fgets() failed with errno 5\nInput/output error\n", 47) = 47
17:46:38.640626 write(1, "Auf Wiedersehen.\n", 17) = -1 EIO (Input/output error)
...

To Reproduce

using Oscar
exit()

System (please complete the following information):
Oscar from today, e.g. 55ea957 or 38f37bc.

You might need to have some version of singular in the path for this to happen.

cc: @paemurru @ThomasBreuer

@benlorenz benlorenz added the bug Something isn't working label Sep 23, 2024
@ThomasBreuer
Copy link
Member

I had observed the same problem, and mentioned it first in a comment for #3478 and then in oscar-system/GAP.jl/issues/971; the solution seemed to be to use GAP 4.13, which we do now.

@fingolfin
Copy link
Member

On Linux one can easily reproduce this multiple time by executing GAP.Globals.CloseSingular() and GAP.Globals.StartSingular() repeatedly.

The singular GAP package starts a Singular interpreter (and PR #4132 is meant to ensure it will always find one, namely the one we bundle with Oscar).

That GAP package also does this:

BindGlobal( "CloseSingular", function (  )
    if IsStream( Sing_Proc )  then
        if not IsClosedStream( Sing_Proc )  then
            WriteLine( Sing_Proc, ";quit;" );
            CloseStream( Sing_Proc );
        else
            Info( InfoSingular, 2, "Singular already closed." );
        fi;
    fi;
    # after closing Singular, the names become out of date.
    SingularNamesThisRing := ShallowCopy( SingularNames );
end );


# Kill Singular when Gap terminates
InstallAtExit( CloseSingular );

So when we exit, it calls CloseSingular which is supposed to end the Singular interpreter by sending it a quit; command. Unfortunately it then does not wait to give Singular time to actually shut down, and just closes the pip the hard way.

In my tests, inserting a 10 milliseconds delay between WriteLine and CloseStream fixed the issue. Of course that's still a hack; it really should be waiting for it to exit.

Indeed the GAP kernel code behind CloseStream should take care of that. It was incorrectly closing the pipe before killing the child. I've fixed that in GAP 4.12. But looking at that code again, I see that it also does a waitpid -- but after closing the pipe. Which seems fishy?

static Obj FuncCLOSE_PTY_IOSTREAM(Obj self, Obj stream)
{
    UInt pty = HashLockStreamIfAvailable(stream);

    // Close down the child
    int status;
    kill(PtyIOStreams[pty].childPID, SIGTERM);
    int retcode = close(PtyIOStreams[pty].ptyFD);
    if (retcode)
        Pr("Strange close return code %d\n", retcode, 0);
    // GAP (or another library) might wait on this PID before
    // we handle it. If that happens, waitpid will return -1.
    retcode = waitpid(PtyIOStreams[pty].childPID, &status, WNOHANG);
    if (retcode == 0) {
        // Give process a second to quit
        sleep(1);
        retcode = waitpid(PtyIOStreams[pty].childPID, &status, WNOHANG);
    }
    if (retcode == 0) {
        // Hard kill process
        kill(PtyIOStreams[pty].childPID, SIGKILL);
        retcode = waitpid(PtyIOStreams[pty].childPID, &status, 0);
    }

    PtyIOStreams[pty].inuse = 0;

    FreeStream(pty);
    HashUnlock(PtyIOStreams);
    return 0;
}

I gotta read up on POSIX semantics to figure out what we really should be doing there (I didn't write that code).

@fingolfin
Copy link
Member

(To clarify: I wrote "pipe" above, but GAP doesn't really open a pipe, it uses pseudo ttys instead)

fingolfin added a commit to fingolfin/Yggdrasil that referenced this issue Sep 25, 2024
... to avoid hang when calling StopSingular(). See also
<oscar-system/Oscar.jl#4136>
fingolfin added a commit to fingolfin/Yggdrasil that referenced this issue Sep 25, 2024
... to avoid hang when calling StopSingular(). See also
<oscar-system/Oscar.jl#4136>
fingolfin added a commit to JuliaPackaging/Yggdrasil that referenced this issue Sep 25, 2024
... to avoid hang when calling StopSingular(). See also
<oscar-system/Oscar.jl#4136>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants