lyveset fastq file parsing bugfix and other improvements #156

kapsakcj · 2023-08-17T20:55:14Z

Setting as a draft for now, need to do further testing in Terra prior to marking ready for review

🛠️ Changes Being Made

`tasks/phylogenetic_inference/task_lyveset.wdl` changes

removed unnecessary ? from read_cleaner input declaration. Also switched to double quotes from single quotes
fixed variable for checking read1 and read2 array lengths
IMPORTANT CHANGE: added bash logic for renaming FASTQs that end in _R{1,2}.fastq.gz to end in _{1,2}.fastq.gz. See WDL for full description
altered how FASTQs are copied or moved into the required input directory for launching lyveSET.
IMPORTANT CHANGE: fixed optional flags for masking phages, masking cliffs, and sample sites.
fixed optional output Files as the paths were incorrect

🧠 Context and Rationale

The lyveSET workflow, more specifically shuffleSplitReads.pl script for merging R1 and R2 FASTQ files is unable to parse FASTQ files that end in _R1.fastq.gz or _R2.fastq.gz

So any FASTQ file brought into Terra with basespace_fetch or sra_fetch workflows would not be able to be used as input for lyveset.

Along the way I made other improvements to the task.

📋 Workflow/Task Steps

Inputs

Outputs

🧪 Testing

Locally

Tested many times locally, was able to run successfully with miniwdl

Terra

Will populate this after testing is finished.

🔬 Quality checks

Pull Request (PR) checklist:

Include a description of what is in this pull request in this message.
The workflow/task has been tested locally and on Terra
The CI/CD has been adjusted and tests are passing
Everything follows the style guide

…tq.gz. Other small improvements. tested fine in miniwdl, need to test in Terra

lyveset fastq parsing

michellescribner · 2023-08-23T17:18:21Z

Launched test on Terra of 5 Campy samples known to previously fail: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/39649f92-f3d3-4319-8fde-c3f540b43c44

kapsakcj · 2023-08-23T17:21:17Z

@michellescribner Thanks for trying these out, seems like it's already showing an error on the step that shuffles the reads

michellescribner · 2023-08-23T17:39:18Z

Dang it.

This is the error for v1.0.1 of Lyve_SET for the set:
shuffleSplitReads.pl: main::readFastqs: ERROR: trying to set Campylobacter_jejunii_1.2-BHQL12.2023-04-25.01-E01_S103_R1_001-FS10002482.fastq.gz as read number 1 for Campylobacter, but it already exists as Campylobacter_jejunii_1.1-BHQL12.2023-04-25.01-B02_S222_R1_001-FS10002485.fastq.gz
This is the error with this branch for the same set:
shuffleSplitReads.pl: main::readFastqs: ERROR: trying to set input-fastqs/Campylobacter_jejunii_1.2-BHQL12.2023-04-25.01-E01_S103_R1_001-FS10002482.fastq.gz as read number 1 for input-fastqs/Campylobacter, but it already exists as input-fastqs/Campylobacter_jejunii_1.1-BHQL12.2023-04-25.01-B02_S222_R1_001-FS10002485.fastq.gz

Issue seems to be persisting

michellescribner · 2023-08-23T17:53:51Z

This branch doesn't resolve one of the issues with this set, which is that LyveSET appears to need there to be a unique identifier for the reads before the first underscore in the read file names. I ought to have realized this. @kapsakcj do you think that we could replace underscores with another character in the read file names (maybe with hyphens?) when you are moving the files?

frankambrosio3 · 2023-08-24T14:33:24Z

Hit an issue with Lyve_SET where samples fail successfully due to insufficient memory allocation to the virtual machine on which this job was running. The reason this run appeared successful yet failed to produce outputs is the Lyve_SET tool is well-written and had built-in error handling, which caused the workflow to complete successfully despite Lyve_SET failing to complete the analysis. This prevented the detection of a memory-related workflow failure, which resulted in Terra's memory-retry feature failing to engage.

We resolved the issue by using the optional memory and disk size inputs to set these parameters to 128GB and 200GB, respectively (from 64GB and 100GB, respectively). The run completed successfully despite taking 22h47m.
Here is the run for reference: https://app.terra.bio/#workspaces/aphl-personal/Institute_for_Environmental_Science_and_Research/job_history/6c4266cf-4b31-425a-8a75-3a2fdea932ca

… to 250GB; rename FASTQ files to replace underscores w dashes except for those surrounding R1/R2; rename FASTQs downloaded via SRA_Fetch and basespace_fetch wfs

michellescribner · 2023-08-29T21:21:02Z

Awesome work @kapsakcj!! This version completed testing successfully: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/8c1da8d2-bcfd-4ecb-8001-82d663930c34.

I'm good with approving this since it fixes the issue but one thing that is less than optimal is that in the output files the samples are named with the index. For example, sample name = "Campylobacter-jejunii-1.1-BHQL12.2023-04-25.01-B02-S222" in the pairwise matrix. It might look nice to first rename the read files using the sample names then replace underscores with hyphens afterwards. Just a nice to have - let me know what you think!

kapsakcj · 2023-08-30T13:57:30Z

@frankambrosio3 I believe the error you observed for this workflow is an indication of running out of disk space, not related to OOM:

Could not write 65536 bytes: Error 4
mergeVcf.sh: ERROR with bcftools merge

# [shortened for brevity]

QSUB ERROR
256
launch_set.pl: Schedule::SGELK::command: ERROR with command: Inappropriate ioctl for device

I've increased the default disk size to 250GB, but I want to keep the default memory at 64GB. It can get expensive fast if we request 128GB memory for every workflow and I don't believe it actually requires that much RAM.

I'm re-running your workflow with the increased disk_size and default memory now (still running at the moment), but as long as it succeeds, I don't think it's necessary to increase the default memory: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/7a611f0f-de65-4965-8dc5-f84327ba9a24

kapsakcj · 2023-08-30T14:01:07Z

@michellescribner thank you!

I appreciate the feedback about output sample names, I'll have to think about this one. Not sure how we would accomplish that.

michellescribner · 2023-08-30T14:14:57Z

@michellescribner thank you!

I appreciate the feedback about output sample names, I'll have to think about this one. Not sure how we would accomplish that.

We can hold off for now then in the interest of getting the patch out!

michellescribner

Thank you Curtis!!!

kapsakcj added 5 commits August 9, 2023 15:23

correct 3 optional input params for lyveset

6dc1155

lyveSET task: added ability to handle FASTQ files ending with _R1.fas…

8116cb6

…tq.gz. Other small improvements. tested fine in miniwdl, need to test in Terra

more concise debug statement for renaming files

4518c50

corrected paths for 2 optional outputs

e5787b3

Merge pull request #154 from theiagen/cjk-lyveset-fastq-parsing

9e96d69

lyveset fastq parsing

overhaul of lyveset FASTQ file ingestion; increased default disk size…

2e9dca2

… to 250GB; rename FASTQ files to replace underscores w dashes except for those surrounding R1/R2; rename FASTQs downloaded via SRA_Fetch and basespace_fetch wfs

kapsakcj marked this pull request as ready for review August 30, 2023 15:31

michellescribner approved these changes Aug 30, 2023

View reviewed changes

michellescribner merged commit f002dbd into main Aug 30, 2023
4 checks passed

kapsakcj deleted the cjk-lyveset-params branch October 11, 2023 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lyveset fastq file parsing bugfix and other improvements #156

lyveset fastq file parsing bugfix and other improvements #156

kapsakcj commented Aug 17, 2023 •

edited

Loading

michellescribner commented Aug 23, 2023

kapsakcj commented Aug 23, 2023

michellescribner commented Aug 23, 2023

michellescribner commented Aug 23, 2023

frankambrosio3 commented Aug 24, 2023

michellescribner commented Aug 29, 2023

kapsakcj commented Aug 30, 2023

kapsakcj commented Aug 30, 2023

michellescribner commented Aug 30, 2023

michellescribner left a comment

lyveset fastq file parsing bugfix and other improvements #156

lyveset fastq file parsing bugfix and other improvements #156

Conversation

kapsakcj commented Aug 17, 2023 • edited Loading

🛠️ Changes Being Made

tasks/phylogenetic_inference/task_lyveset.wdl changes

🧠 Context and Rationale

📋 Workflow/Task Steps

Inputs

Outputs

🧪 Testing

Locally

Terra

🔬 Quality checks

michellescribner commented Aug 23, 2023

kapsakcj commented Aug 23, 2023

michellescribner commented Aug 23, 2023

michellescribner commented Aug 23, 2023

frankambrosio3 commented Aug 24, 2023

michellescribner commented Aug 29, 2023

kapsakcj commented Aug 30, 2023

kapsakcj commented Aug 30, 2023

michellescribner commented Aug 30, 2023

michellescribner left a comment

Choose a reason for hiding this comment

kapsakcj commented Aug 17, 2023 •

edited

Loading

`tasks/phylogenetic_inference/task_lyveset.wdl` changes