-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SRA_fetch workflow & fastq-dl
task improvements
#150
Conversation
…e to only download from SRA (instead of ENA); capture date and output as string; fastq-dl is always verbose; added --cpus option to cmd; added string outputs for fastq-dl version, docker image, and date it was run; added maxRetries to runtime block.
Warning, it seems that there is a tiny bug with $ docker run quay.io/biocontainers/fastq-dl:2.0.3--pyhdfd78af_0 fastq-dl --version
fastq-dl, version 2.0.2 Robert said he will fix in the next release but wanted to note that here since we're setting this as the new default |
Are any of these SRA-Lite in ENA? |
Hmm I'm not sure about ENA but there were some that were SRA Lite format even when downloaded directly from NCBI 😢 Not much we can do about those other than report to NCBI The 2 SRRs that spurred this whole thing are listed here: rpetit3/fastq-dl#23 |
@kapsakcj is correct, not much we can do when all SRA has made is available is the SRA Lite format. I think samples originating from ENA should be OK, unless for some reason SRA only makes the Lite version available. Only issue you might run into with the defaults, is when a sample is available from ENA, but hasn't been synced to SRA yet. |
Ah, I did not think about this. I'll update the default to remove Then I'll re-run my test set in Terra |
Tested successfully in Terra after adjusting the default options for fastq-dl and adjusting output filename for the Run metadata TSV |
Nice! And threw one more test in for peace of mind: https://app.terra.bio/#workspaces/theiagen-validations/Libui-Sandbox_2023/job_history/f531db43-af72-4600-8f1e-fd61b2704561 |
🛠️ Changes Being Made
tasks/utilities/task_sra_fetch.wdl
us-docker.pkg.dev/general-theiagen/biocontainers/fastq-dl:2.0.4--pyhdfd78af_0
fastq_dl_opts
to"--provider sra --only-provider"
so that the default of the workflow is to only download from SRA. If the user wants to use ENA instead, they can use""
or"--provider ena"
to revert.meta
block and setvolatile
to true so that call caching is always off.fastq-dl
cmd into multiple lines--cpus
option to cmd--verbose
option to cmd so it's always chatty 🗣️fastq_metadata
which is thefastq-run-info.tsv
produced by fastq-dl that contains metadata about the Run that was downloaded.fastq-dl
version usedfastq_dl_date
to capture when files were downloadedworkflows/utilities/data_import/wf_sra_fetch.wdl
🧠 Context and Rationale
The main motivation for this PR is because of a somewhat-rare issue/bug when using
fastq-dl
's default provider, ENA, to download FASTQ files.ENA regularly "syncs" data with NCBI SRA (and DDBJ), turns out sometimes (not sure how frequently) they have started syncing SRA Lite formatted FASTQ files (all Qscores=Q30) instead of the original FASTQ files with all original Qscores (AKA SRA Normalized format)
It doesn't happen for all SRR accessions, just the accessions that ENA has synced in SRA Lite format.
So the new default should pull directly from SRA, in SRA Normalized format (where available). I have only run into 1 SRR where the original FASTQ files were unavailable so you can only download the SRA Lite formatted FASTQ files. For these rare occurrences, the SRA helpdesk should be contacted.
📋 Workflow/Task Steps
Inputs
Outputs
see above as well as the WDL files
🧪 Testing
Locally
Tested locally with
miniwdl
Terra
Testing now with 56 SRR/ERR/DRR accessions here: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/6a357d50-67bd-451b-8845-9be16cae17c6
I also plan to run TheiaProk_Illumina_PE to verify that Qscores averages are not exactly Q30 to show that these FASTQs are in SRA Normalized format (original Qscores).
🔬 Quality checks
Pull Request (PR) checklist: