You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m trying to download some fastq files from SRA using SRA-Toolkit and place them into an azure storage. However, it appears that the working directories of jobs are not being cleared after the jobs finish.
Expected behavior and actual behavior
With SRA-Toolkit, .sra files are first prefetched and then those files are used to download the fastq files for the same SRA ID. I have a step in my script that I run in my pipeline that deletes the directory containing all the .sra and other intermediate files, but since I’m moving the fastq files to azure storage in my process, I don’t have anything in my script to get rid of those afterwards, and I also assumed that the working directories of the nodes would be cleared after the jobs complete.
However, when attempting to download the fastqs from a manifest of 100 or more SRA IDs, the pipeline would periodically shut down, providing an error that the fastq file for a given ID cannot be found (to move to azure storage). On azure it also shows nodes as unusable when this happens. If I revise my manifest file to exclude the SRA IDs that I’ve already successfully pulled and then rerun the pipeline, it runs just fine. So that leads me to believe that the fastqs are not being deleted, thus piling up and taking up storage such that new fastq’s cannot be downloaded and thus not found when trying to move them to azure. This occurs even when I’m using a Standard_D32_v3 vm type and have the cleanup setting set to true. Although rerunning the pipeline with an updated manifest will eventually get the job, it’s rather inconvenient and inefficient, so I’m wondering if there was something I’m missing or if this is some sort of bug. Thanks!
Steps to reproduce the problem
I’ve included a tar.gz example pipeline to provide my workflows, processes, script, and nextflow.config. Any personal keys or information have been removed from the config and replaced with blank strings. Keep in mind I've changed the vm type to a D2 in this example pipeline in order to make it fail faster. I've also included a sample manifest, though I'm currently having a bit of trouble with getting the exact same error as before.
It's noteworthy that in this tarred pipeline I'm not trying to pull SRA files that require an ngc key to access, and before I was running into the aforementioned issue when trying to pull IDs that do require a key. But as mentioned, restarting the pipeline every time it fails resulted in gradual success, so I don't think the key is the issue for reaching max capacity in the nodes. At the very least, my config and workflow can demonstrate if I'm doing anything that could lead to the working directories filling up.
nextflow run example_sra_call_fail -profile az --manifest example_manifest.txt --output_folder az://test
Program output
The following is the error I would get when my original pipeline would break due to reaching max capacity and no longer being able to download new fastq files to send to azure storage (not able to be properly formatted as code block):
Error executing process > 'sra_pull:sra_pull_process (53)'
Caused by:
Missing output file(s) `*.gz` expected by process `sra_pull:sra_pull_process (53)`
Command executed [/home/ljl/sra_call/templates/sra_pull.sh]:
#!/usr/bin/env bash
echo "Pulling sra file with following SRA accession from dbGaP database: " SRR1312784
prefetch --ngc sra_key.ngc SRR1312784
echo "Finished downloading sra file with following accession: " SRR1312784
echo "Output file(s):"
ls SRR1312784
fastq-dump --split-files SRR1312784
gzip -f *.fastq
rm -rf SRR1312784
Command exit status:
0
Command output:
Pulling sra file with following SRA accession from dbGaP database: SRR1312784
Finished downloading sra file with following accession: SRR1312784
Output file(s):
Command error:
2023-04-05T00:03:15 prefetch.3.0.3: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
ls: SRR1312784: No such file or directory
Failed to call external services.
gzip: *.fastq: No such file or directory
Work dir:
az://work/9b/0cb2804f7287e49aa04199b1bc0683
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
Environment
Nextflow version: 22.04.5 build 5708
Java version: openjdk version "11.0.18"
Operating system: WSL Linux
Bash version: 5.0.17(1)-release
The text was updated successfully, but these errors were encountered:
jstnchou
changed the title
Work directories not being cleared after jobs complete, resulting in maxing out on vm storage before pipeline completes
Work directories not clearing after jobs complete, causing maxed out vm storage before pipeline completes
Apr 5, 2023
Automatic cleanup is a popular topic around here. Welcome to the club.
The cleanup option currently doesn't work correctly with cloud storage because of a bug, but there is a fix in review: #3836.
There is also a discussion to make the cleanup option delete these files during the pipeline execution rather than at the end: #452. Also a PR in progress: #3849.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Bug report
Hello,
I’m trying to download some fastq files from SRA using SRA-Toolkit and place them into an azure storage. However, it appears that the working directories of jobs are not being cleared after the jobs finish.
Expected behavior and actual behavior
With SRA-Toolkit, .sra files are first prefetched and then those files are used to download the fastq files for the same SRA ID. I have a step in my script that I run in my pipeline that deletes the directory containing all the .sra and other intermediate files, but since I’m moving the fastq files to azure storage in my process, I don’t have anything in my script to get rid of those afterwards, and I also assumed that the working directories of the nodes would be cleared after the jobs complete.
However, when attempting to download the fastqs from a manifest of 100 or more SRA IDs, the pipeline would periodically shut down, providing an error that the fastq file for a given ID cannot be found (to move to azure storage). On azure it also shows nodes as unusable when this happens. If I revise my manifest file to exclude the SRA IDs that I’ve already successfully pulled and then rerun the pipeline, it runs just fine. So that leads me to believe that the fastqs are not being deleted, thus piling up and taking up storage such that new fastq’s cannot be downloaded and thus not found when trying to move them to azure. This occurs even when I’m using a Standard_D32_v3 vm type and have the cleanup setting set to true. Although rerunning the pipeline with an updated manifest will eventually get the job, it’s rather inconvenient and inefficient, so I’m wondering if there was something I’m missing or if this is some sort of bug. Thanks!
Steps to reproduce the problem
I’ve included a tar.gz example pipeline to provide my workflows, processes, script, and nextflow.config. Any personal keys or information have been removed from the config and replaced with blank strings. Keep in mind I've changed the vm type to a D2 in this example pipeline in order to make it fail faster. I've also included a sample manifest, though I'm currently having a bit of trouble with getting the exact same error as before.
It's noteworthy that in this tarred pipeline I'm not trying to pull SRA files that require an ngc key to access, and before I was running into the aforementioned issue when trying to pull IDs that do require a key. But as mentioned, restarting the pipeline every time it fails resulted in gradual success, so I don't think the key is the issue for reaching max capacity in the nodes. At the very least, my config and workflow can demonstrate if I'm doing anything that could lead to the working directories filling up.
example_manifest.txt
example_sra_call_fail.tar.gz
Command to run pipeline:
Program output
The following is the error I would get when my original pipeline would break due to reaching max capacity and no longer being able to download new fastq files to send to azure storage (not able to be properly formatted as code block):
Environment
The text was updated successfully, but these errors were encountered: