Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lyveset fastq file parsing bugfix and other improvements #156

Merged
merged 6 commits into from
Aug 30, 2023

Conversation

kapsakcj
Copy link
Contributor

@kapsakcj kapsakcj commented Aug 17, 2023

Setting as a draft for now, need to do further testing in Terra prior to marking ready for review

🛠️ Changes Being Made

tasks/phylogenetic_inference/task_lyveset.wdl changes

  • removed unnecessary ? from read_cleaner input declaration. Also switched to double quotes from single quotes
  • fixed variable for checking read1 and read2 array lengths
  • IMPORTANT CHANGE: added bash logic for renaming FASTQs that end in _R{1,2}.fastq.gz to end in _{1,2}.fastq.gz. See WDL for full description
  • altered how FASTQs are copied or moved into the required input directory for launching lyveSET.
  • IMPORTANT CHANGE: fixed optional flags for masking phages, masking cliffs, and sample sites.
  • fixed optional output Files as the paths were incorrect

🧠 Context and Rationale

The lyveSET workflow, more specifically shuffleSplitReads.pl script for merging R1 and R2 FASTQ files is unable to parse FASTQ files that end in _R1.fastq.gz or _R2.fastq.gz

So any FASTQ file brought into Terra with basespace_fetch or sra_fetch workflows would not be able to be used as input for lyveset.

Along the way I made other improvements to the task.

📋 Workflow/Task Steps

Inputs

Outputs

🧪 Testing

Locally

Tested many times locally, was able to run successfully with miniwdl

Terra

Will populate this after testing is finished.

🔬 Quality checks

Pull Request (PR) checklist:

  • Include a description of what is in this pull request in this message.
  • The workflow/task has been tested locally and on Terra
  • The CI/CD has been adjusted and tests are passing
  • Everything follows the style guide

@michellescribner
Copy link
Contributor

@kapsakcj
Copy link
Contributor Author

@michellescribner Thanks for trying these out, seems like it's already showing an error on the step that shuffles the reads

@michellescribner
Copy link
Contributor

Dang it.

This is the error for v1.0.1 of Lyve_SET for the set:
shuffleSplitReads.pl: main::readFastqs: ERROR: trying to set Campylobacter_jejunii_1.2-BHQL12.2023-04-25.01-E01_S103_R1_001-FS10002482.fastq.gz as read number 1 for Campylobacter, but it already exists as Campylobacter_jejunii_1.1-BHQL12.2023-04-25.01-B02_S222_R1_001-FS10002485.fastq.gz
This is the error with this branch for the same set:
shuffleSplitReads.pl: main::readFastqs: ERROR: trying to set input-fastqs/Campylobacter_jejunii_1.2-BHQL12.2023-04-25.01-E01_S103_R1_001-FS10002482.fastq.gz as read number 1 for input-fastqs/Campylobacter, but it already exists as input-fastqs/Campylobacter_jejunii_1.1-BHQL12.2023-04-25.01-B02_S222_R1_001-FS10002485.fastq.gz

Issue seems to be persisting

@michellescribner
Copy link
Contributor

This branch doesn't resolve one of the issues with this set, which is that LyveSET appears to need there to be a unique identifier for the reads before the first underscore in the read file names. I ought to have realized this. @kapsakcj do you think that we could replace underscores with another character in the read file names (maybe with hyphens?) when you are moving the files?

@frankambrosio3
Copy link
Contributor

Hit an issue with Lyve_SET where samples fail successfully due to insufficient memory allocation to the virtual machine on which this job was running. The reason this run appeared successful yet failed to produce outputs is the Lyve_SET tool is well-written and had built-in error handling, which caused the workflow to complete successfully despite Lyve_SET failing to complete the analysis. This prevented the detection of a memory-related workflow failure, which resulted in Terra's memory-retry feature failing to engage.

We resolved the issue by using the optional memory and disk size inputs to set these parameters to 128GB and 200GB, respectively (from 64GB and 100GB, respectively). The run completed successfully despite taking 22h47m.
Here is the run for reference: https://app.terra.bio/#workspaces/aphl-personal/Institute_for_Environmental_Science_and_Research/job_history/6c4266cf-4b31-425a-8a75-3a2fdea932ca

… to 250GB; rename FASTQ files to replace underscores w dashes except for those surrounding R1/R2; rename FASTQs downloaded via SRA_Fetch and basespace_fetch wfs
@michellescribner
Copy link
Contributor

Awesome work @kapsakcj!! This version completed testing successfully: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/8c1da8d2-bcfd-4ecb-8001-82d663930c34.

I'm good with approving this since it fixes the issue but one thing that is less than optimal is that in the output files the samples are named with the index. For example, sample name = "Campylobacter-jejunii-1.1-BHQL12.2023-04-25.01-B02-S222" in the pairwise matrix. It might look nice to first rename the read files using the sample names then replace underscores with hyphens afterwards. Just a nice to have - let me know what you think!

@kapsakcj
Copy link
Contributor Author

@frankambrosio3 I believe the error you observed for this workflow is an indication of running out of disk space, not related to OOM:

Could not write 65536 bytes: Error 4
mergeVcf.sh: ERROR with bcftools merge

# [shortened for brevity]

QSUB ERROR
256
launch_set.pl: Schedule::SGELK::command: ERROR with command: Inappropriate ioctl for device

I've increased the default disk size to 250GB, but I want to keep the default memory at 64GB. It can get expensive fast if we request 128GB memory for every workflow and I don't believe it actually requires that much RAM.

I'm re-running your workflow with the increased disk_size and default memory now (still running at the moment), but as long as it succeeds, I don't think it's necessary to increase the default memory: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/7a611f0f-de65-4965-8dc5-f84327ba9a24

@kapsakcj
Copy link
Contributor Author

@michellescribner thank you!

I appreciate the feedback about output sample names, I'll have to think about this one. Not sure how we would accomplish that.

@michellescribner
Copy link
Contributor

@michellescribner thank you!

I appreciate the feedback about output sample names, I'll have to think about this one. Not sure how we would accomplish that.

We can hold off for now then in the interest of getting the patch out!

@kapsakcj kapsakcj marked this pull request as ready for review August 30, 2023 15:31
Copy link
Contributor

@michellescribner michellescribner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Curtis!!!

@michellescribner michellescribner merged commit f002dbd into main Aug 30, 2023
4 checks passed
@kapsakcj kapsakcj deleted the cjk-lyveset-params branch October 11, 2023 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants