Work directories not clearing after jobs complete, causing maxed out vm storage before pipeline completes #3838

jstnchou · 2023-04-05T21:03:39Z

Bug report

Hello,

I’m trying to download some fastq files from SRA using SRA-Toolkit and place them into an azure storage. However, it appears that the working directories of jobs are not being cleared after the jobs finish.

Expected behavior and actual behavior

With SRA-Toolkit, .sra files are first prefetched and then those files are used to download the fastq files for the same SRA ID. I have a step in my script that I run in my pipeline that deletes the directory containing all the .sra and other intermediate files, but since I’m moving the fastq files to azure storage in my process, I don’t have anything in my script to get rid of those afterwards, and I also assumed that the working directories of the nodes would be cleared after the jobs complete.

However, when attempting to download the fastqs from a manifest of 100 or more SRA IDs, the pipeline would periodically shut down, providing an error that the fastq file for a given ID cannot be found (to move to azure storage). On azure it also shows nodes as unusable when this happens. If I revise my manifest file to exclude the SRA IDs that I’ve already successfully pulled and then rerun the pipeline, it runs just fine. So that leads me to believe that the fastqs are not being deleted, thus piling up and taking up storage such that new fastq’s cannot be downloaded and thus not found when trying to move them to azure. This occurs even when I’m using a Standard_D32_v3 vm type and have the cleanup setting set to true. Although rerunning the pipeline with an updated manifest will eventually get the job, it’s rather inconvenient and inefficient, so I’m wondering if there was something I’m missing or if this is some sort of bug. Thanks!

Steps to reproduce the problem

I’ve included a tar.gz example pipeline to provide my workflows, processes, script, and nextflow.config. Any personal keys or information have been removed from the config and replaced with blank strings. Keep in mind I've changed the vm type to a D2 in this example pipeline in order to make it fail faster. I've also included a sample manifest, though I'm currently having a bit of trouble with getting the exact same error as before.

It's noteworthy that in this tarred pipeline I'm not trying to pull SRA files that require an ngc key to access, and before I was running into the aforementioned issue when trying to pull IDs that do require a key. But as mentioned, restarting the pipeline every time it fails resulted in gradual success, so I don't think the key is the issue for reaching max capacity in the nodes. At the very least, my config and workflow can demonstrate if I'm doing anything that could lead to the working directories filling up.

example_manifest.txt
example_sra_call_fail.tar.gz

Command to run pipeline:

nextflow run example_sra_call_fail -profile az --manifest example_manifest.txt --output_folder az://test

Program output

The following is the error I would get when my original pipeline would break due to reaching max capacity and no longer being able to download new fastq files to send to azure storage (not able to be properly formatted as code block):

Error executing process > 'sra_pull:sra_pull_process (53)'

Caused by:
  Missing output file(s) `*.gz` expected by process `sra_pull:sra_pull_process (53)`

Command executed [/home/ljl/sra_call/templates/sra_pull.sh]:

  #!/usr/bin/env bash

  echo "Pulling sra file with following SRA accession from dbGaP database: " SRR1312784


  prefetch --ngc sra_key.ngc SRR1312784


  echo "Finished downloading sra file with following accession: " SRR1312784


  echo "Output file(s):"

  ls SRR1312784


  fastq-dump --split-files SRR1312784


  gzip -f *.fastq

  rm -rf SRR1312784

Command exit status:
  0

Command output:
  Pulling sra file with following SRA accession from dbGaP database:  SRR1312784

  Finished downloading sra file with following accession:  SRR1312784
  Output file(s):

Command error:
  2023-04-05T00:03:15 prefetch.3.0.3: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
  ls: SRR1312784: No such file or directory
  Failed to call external services.
  gzip: *.fastq: No such file or directory

Work dir:
  az://work/9b/0cb2804f7287e49aa04199b1bc0683

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Environment

Nextflow version: 22.04.5 build 5708
Java version: openjdk version "11.0.18"
Operating system: WSL Linux
Bash version: 5.0.17(1)-release

The text was updated successfully, but these errors were encountered:

bentsherman · 2023-04-06T17:20:18Z

Automatic cleanup is a popular topic around here. Welcome to the club.

The cleanup option currently doesn't work correctly with cloud storage because of a bug, but there is a fix in review: #3836.

There is also a discussion to make the cleanup option delete these files during the pipeline execution rather than at the end: #452. Also a PR in progress: #3849.

stale · 2023-09-17T02:50:26Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bentsherman · 2023-09-17T12:43:14Z

Closing as duplicate of #452

jstnchou changed the title ~~Work directories not being cleared after jobs complete, resulting in maxing out on vm storage before pipeline completes~~ Work directories not clearing after jobs complete, causing maxed out vm storage before pipeline completes Apr 5, 2023

bentsherman added the storage/azure label Apr 6, 2023

stale bot added the stale label Sep 17, 2023

bentsherman closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2023

stale bot removed the stale label Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work directories not clearing after jobs complete, causing maxed out vm storage before pipeline completes #3838

Work directories not clearing after jobs complete, causing maxed out vm storage before pipeline completes #3838

jstnchou commented Apr 5, 2023 •

edited by bentsherman

Loading

bentsherman commented Apr 6, 2023 •

edited

Loading

stale bot commented Sep 17, 2023

bentsherman commented Sep 17, 2023

Work directories not clearing after jobs complete, causing maxed out vm storage before pipeline completes #3838

Work directories not clearing after jobs complete, causing maxed out vm storage before pipeline completes #3838

Comments

jstnchou commented Apr 5, 2023 • edited by bentsherman Loading

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

bentsherman commented Apr 6, 2023 • edited Loading

stale bot commented Sep 17, 2023

bentsherman commented Sep 17, 2023

jstnchou commented Apr 5, 2023 •

edited by bentsherman

Loading

bentsherman commented Apr 6, 2023 •

edited

Loading