Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRA_fetch workflow & fastq-dl task improvements #150

Merged
merged 6 commits into from
Aug 23, 2023

Conversation

kapsakcj
Copy link
Contributor

@kapsakcj kapsakcj commented Aug 10, 2023

🛠️ Changes Being Made

tasks/utilities/task_sra_fetch.wdl

  • upgraded to v2.0.4 and default docker is now us-docker.pkg.dev/general-theiagen/biocontainers/fastq-dl:2.0.4--pyhdfd78af_0
  • set default for String input fastq_dl_opts to "--provider sra --only-provider" so that the default of the workflow is to only download from SRA. If the user wants to use ENA instead, they can use "" or "--provider ena" to revert.
  • added meta block and set volatile to true so that call caching is always off.
  • broke fastq-dl cmd into multiple lines
  • added --cpus option to cmd
  • added --verbose option to cmd so it's always chatty 🗣️
  • added new File output fastq_metadata which is the fastq-run-info.tsv produced by fastq-dl that contains metadata about the Run that was downloaded.
  • added new String output for capturing fastq-dl version used
  • added new String output for capturing docker image used
  • added new String output fastq_dl_date to capture when files were downloaded

workflows/utilities/data_import/wf_sra_fetch.wdl

  • added 4 new required outputs (4 new output columns in Terra):
    File fastq_dl_fastq_metadata = fastq_dl_sra.fastq_metadata
    String fastq_dl_version = fastq_dl_sra.fastq_dl_version
    String fastq_dl_docker = fastq_dl_sra.fastq_dl_docker
    String fastq_dl_date = fastq_dl_sra.fastq_dl_date

🧠 Context and Rationale

The main motivation for this PR is because of a somewhat-rare issue/bug when using fastq-dl's default provider, ENA, to download FASTQ files.

ENA regularly "syncs" data with NCBI SRA (and DDBJ), turns out sometimes (not sure how frequently) they have started syncing SRA Lite formatted FASTQ files (all Qscores=Q30) instead of the original FASTQ files with all original Qscores (AKA SRA Normalized format)

It doesn't happen for all SRR accessions, just the accessions that ENA has synced in SRA Lite format.

So the new default should pull directly from SRA, in SRA Normalized format (where available). I have only run into 1 SRR where the original FASTQ files were unavailable so you can only download the SRA Lite formatted FASTQ files. For these rare occurrences, the SRA helpdesk should be contacted.

📋 Workflow/Task Steps

Inputs

Outputs

see above as well as the WDL files

🧪 Testing

Locally

Tested locally with miniwdl

Terra

Testing now with 56 SRR/ERR/DRR accessions here: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/6a357d50-67bd-451b-8845-9be16cae17c6

I also plan to run TheiaProk_Illumina_PE to verify that Qscores averages are not exactly Q30 to show that these FASTQs are in SRA Normalized format (original Qscores).

🔬 Quality checks

Pull Request (PR) checklist:

  • Include a description of what is in this pull request in this message.
  • The workflow/task has been tested locally and on Terra
  • The CI/CD has been adjusted and tests are passing
  • Everything follows the style guide

…e to only download from SRA (instead of ENA); capture date and output as string; fastq-dl is always verbose; added --cpus option to cmd; added string outputs for fastq-dl version, docker image, and date it was run; added maxRetries to runtime block.
@kapsakcj
Copy link
Contributor Author

Warning, it seems that there is a tiny bug with fastq-dl v2.0.3 where it prints v2.0.2 incorrectly:

$ docker run quay.io/biocontainers/fastq-dl:2.0.3--pyhdfd78af_0 fastq-dl --version
fastq-dl, version 2.0.2

Robert said he will fix in the next release but wanted to note that here since we're setting this as the new default

@kapsakcj kapsakcj marked this pull request as ready for review August 16, 2023 16:42
@kevinlibuit
Copy link
Contributor

Thanks, @kapsakcj! Looks really straight forward and I can launch a few test runs on Terra side. Also tagging @rpetit3 for review for peace of mind RE use of fastq-dl!

@kevinlibuit
Copy link
Contributor

Terra
Testing now with 56 SRR/ERR/DRR accessions here: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/6a357d50-67bd-451b-8845-9be16cae17c6

Are any of these SRA-Lite in ENA?

@kapsakcj
Copy link
Contributor Author

Are any of these SRA-Lite in ENA?

Hmm I'm not sure about ENA but there were some that were SRA Lite format even when downloaded directly from NCBI 😢 Not much we can do about those other than report to NCBI

The 2 SRRs that spurred this whole thing are listed here: rpetit3/fastq-dl#23

@rpetit3
Copy link
Contributor

rpetit3 commented Aug 21, 2023

@kapsakcj is correct, not much we can do when all SRA has made is available is the SRA Lite format. I think samples originating from ENA should be OK, unless for some reason SRA only makes the Lite version available.

Only issue you might run into with the defaults, is when a sample is available from ENA, but hasn't been synced to SRA yet.

@kapsakcj
Copy link
Contributor Author

Only issue you might run into with the defaults, is when a sample is available from ENA, but hasn't been synced to SRA yet.

Ah, I did not think about this. I'll update the default to remove --only-provider

Then I'll re-run my test set in Terra

@kapsakcj
Copy link
Contributor Author

Tested successfully in Terra after adjusting the default options for fastq-dl and adjusting output filename for the Run metadata TSV

https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/24fda19d-2c76-4b37-812b-040c6de4eccf

@kevinlibuit
Copy link
Contributor

Tested successfully in Terra after adjusting the default options for fastq-dl and adjusting output filename for the Run metadata TSV

https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/24fda19d-2c76-4b37-812b-040c6de4eccf

Nice! And threw one more test in for peace of mind: https://app.terra.bio/#workspaces/theiagen-validations/Libui-Sandbox_2023/job_history/f531db43-af72-4600-8f1e-fd61b2704561

@kevinlibuit kevinlibuit merged commit 4ee9b6d into main Aug 23, 2023
5 checks passed
@kapsakcj kapsakcj deleted the cjk-sra-fetch-defaults branch October 11, 2023 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants