Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic failure of test_kraken_multi teardown #839

Closed
dpark01 opened this issue Jul 6, 2018 · 15 comments · Fixed by #841
Closed

Non-deterministic failure of test_kraken_multi teardown #839

dpark01 opened this issue Jul 6, 2018 · 15 comments · Fixed by #841

Comments

@dpark01
Copy link
Member

dpark01 commented Jul 6, 2018

Sometimes Travis fails, sometimes it succeeds, on this particular part of the test suite. Probably some race condition as xdist parallelizes the tests.

Mysteries include:

  • Why is the existence of libgkl_compression2251932993111688424.so non-deterministic?
  • Why is there a weird number in that filename?
  • Why is os.unlink trying to use a compression library to delete files?
  • Will this problem go away by itself? If so, will that make us feel better or worse about it?
_______ ERROR at teardown of test_kraken_multi[TestMetagenomicsViralMix] _______
[gw1] linux -- Python 3.6.5 /home/travis/build/broadinstitute/viral-ngs/tools/conda-tools/default/bin/python
def reset():
>       shutil.rmtree(new_tempdir)
conftest.py:64: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tools/conda-tools/default/lib/python3.6/shutil.py:480: in rmtree
    _rmtree_safe_fd(fd, path, onerror)
tools/conda-tools/default/lib/python3.6/shutil.py:438: in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
topfd = 11, path = '/tmp/pytest-of-travis/pytest-0/popen-gw1/test-function0'
onerror = <function rmtree.<locals>.onerror at 0x7f004cc5f1e0>
    def _rmtree_safe_fd(topfd, path, onerror):
        names = []
        try:
            names = os.listdir(topfd)
        except OSError as err:
            err.filename = path
            onerror(os.listdir, path, sys.exc_info())
        for name in names:
            fullname = os.path.join(path, name)
            try:
                orig_st = os.stat(name, dir_fd=topfd, follow_symlinks=False)
                mode = orig_st.st_mode
            except OSError:
                mode = 0
            if stat.S_ISDIR(mode):
                try:
                    dirfd = os.open(name, os.O_RDONLY, dir_fd=topfd)
                except OSError:
                    onerror(os.open, fullname, sys.exc_info())
                else:
                    try:
                        if os.path.samestat(orig_st, os.fstat(dirfd)):
                            _rmtree_safe_fd(dirfd, fullname, onerror)
                            try:
                                os.rmdir(name, dir_fd=topfd)
                            except OSError:
                                onerror(os.rmdir, fullname, sys.exc_info())
                        else:
                            try:
                                # This can only happen if someone replaces
                                # a directory with a symlink after the call to
                                # stat.S_ISDIR above.
                                raise OSError("Cannot call rmtree on a symbolic "
                                              "link")
                            except OSError:
                                onerror(os.path.islink, fullname, sys.exc_info())
                    finally:
                        os.close(dirfd)
            else:
                try:
>                   os.unlink(name, dir_fd=topfd)
E                   FileNotFoundError: [Errno 2] No such file or directory: 'libgkl_compression2251932993111688424.so'
tools/conda-tools/default/lib/python3.6/shutil.py:436: FileNotFoundError
@notestaff
Copy link
Contributor

i've seen this, have fixes, will check in soon.

@tomkinsc
Copy link
Member

tomkinsc commented Jul 6, 2018

Is it a race condition of multiple threads trying to remove the same shared object file?
Guessing this call is to blame?
https://github.com/broadinstitute/viral-ngs/blob/master/tools/kraken.py#L168

@dpark01
Copy link
Member Author

dpark01 commented Jul 6, 2018

Oh... maybe the Picard pipes aren't being closed properly before end-of-method and the test fixture blows it away before Picard was done?

@notestaff
Copy link
Contributor

i think it has to do with tmpdir_function and tmpdir_module fixtures in conftest.py . reading the docs of tmpfile_factory, that way of creating tempdirs is less robust than with mkdtemp. also, the rmtree calls don't check if the tree still exists before trying to remove it.

@dpark01
Copy link
Member Author

dpark01 commented Jul 6, 2018

Maybe we should add a .poll() on each of the fastq_pipes after this line:
https://github.com/broadinstitute/viral-ngs/blob/master/tools/kraken.py#L173

@notestaff
Copy link
Contributor

maybe there are two separate issues here :)

@dpark01
Copy link
Member Author

dpark01 commented Jul 6, 2018

Yes it could be all of the above. The reason I'm thinking @tomkinsc 's highlighted line of code might be at play is because if you google around for libgkl_compression.so, it's all Picard/GATK-related. So I think the shutil.rmtree is happening concurrently with Picard trying to use some files in the directory tree being blown away.

@notestaff
Copy link
Contributor

simplest fix is to just check if tree exists before shutil.rmtree(new_tempdir) . but i think it's better to rewrite to use mkdtemp

@dpark01
Copy link
Member Author

dpark01 commented Jul 6, 2018

I bet both bits of code have issues that only really present as noticable problems when in combination.

I like using mkdtemp but I think it's still good to rmtree at the end of a test fixture because as the test suite increases in size, you run out of local tmp space if you don't clean up after big tests. Can't we just run shutil.rmtree(new_tempdir, ignore_errors=True)? (equivalent to rm -rf).

BTW I'm not sure if the problem is that the directory doesn't exist vs. a file that was originally in the directory tree (when rmtree started walking the tree) disappeared (because it was a Picard temp file) before rmtree got around to unlinking that particular file. So rm -rf would make the test succeed, but it's probably still a good idea to make sure kraken doesn't return from its method invocation before all its pipes are closed out and cleaned up.

@notestaff
Copy link
Contributor

maybe util.file.fifo() should poll the pipes after yield, before unlinking them?

@notestaff
Copy link
Contributor

and/or add a small time delay before unlinking pipes and returning.

@dpark01
Copy link
Member Author

dpark01 commented Jul 6, 2018

Actually I just realized that the kraken code was using util.file.fifo which makes things a bit tricky. These aren't really pipes, these are named pipes / fifos that behave like files. There's no way to poll or close these manually that I know of. Because our python code never opened them, we just passed the strings (filenames) to other programs that handled all that.

The Picard side of the pipe can happily play with standard python pipe-fitting (it can interleave the fastqs and is often used that way to pipe to bwa). It's the kraken side that has issues. Maybe we can revert the code here to use temp files again instead of fifos until we eventually tackle the larger question of how to clean up the pipe-fitting while also adding unpaired read support (#820). Then again, I'm not quite sure why reverting to temp files would solve the race condition... hm.

Or time delay...

@notestaff
Copy link
Contributor

adding a 2s delay at the end of metagenomics.diamond() does fix one transient bug in my experiments on AWS.

@dpark01
Copy link
Member Author

dpark01 commented Jul 6, 2018

Wow, 2s is much longer than I would've tried (maybe 100ms)..

Oh wait! We're calling picard.execute(background=True) but ignoring the return value. That return value is a Popen object that we can do the proper if picard_ps.poll(): raise subprocess.CalledProcessError(...) thing with. The fact that we're ignoring it means there's probably nothing really forcing it to close properly...

@yesimon
Copy link
Contributor

yesimon commented Jul 9, 2018

I’ve had nondeterministic test failures in the past but never this specific one. I think it’s highly likely to be related to some combination of xdist, temp directory removal, and having background processes. I don’t think the named fifos should be causing problems because they are essentially real files to the program as long as they don’t try to seek or tell.

I’ve tried in the past to add ps.wait commands to finish up all the background processes but it didn’t seem to have any effect. Regardless, adding it in should not cause any harm.

Mainly trying to glue together std streams of subprocesses through python seemed to cause the most issues, so that’s something that I try to avoid and use named fifos instead.

This was referenced Jul 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants