-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lyveset fastq file parsing bugfix and other improvements #156
Conversation
…tq.gz. Other small improvements. tested fine in miniwdl, need to test in Terra
lyveset fastq parsing
Launched test on Terra of 5 Campy samples known to previously fail: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/39649f92-f3d3-4319-8fde-c3f540b43c44 |
@michellescribner Thanks for trying these out, seems like it's already showing an error on the step that shuffles the reads |
Dang it. This is the error for v1.0.1 of Lyve_SET for the set: Issue seems to be persisting |
This branch doesn't resolve one of the issues with this set, which is that LyveSET appears to need there to be a unique identifier for the reads before the first underscore in the read file names. I ought to have realized this. @kapsakcj do you think that we could replace underscores with another character in the read file names (maybe with hyphens?) when you are moving the files? |
Hit an issue with Lyve_SET where samples fail successfully due to insufficient memory allocation to the virtual machine on which this job was running. The reason this run appeared successful yet failed to produce outputs is the Lyve_SET tool is well-written and had built-in error handling, which caused the workflow to complete successfully despite Lyve_SET failing to complete the analysis. This prevented the detection of a memory-related workflow failure, which resulted in Terra's memory-retry feature failing to engage. We resolved the issue by using the optional memory and disk size inputs to set these parameters to 128GB and 200GB, respectively (from 64GB and 100GB, respectively). The run completed successfully despite taking 22h47m. |
… to 250GB; rename FASTQ files to replace underscores w dashes except for those surrounding R1/R2; rename FASTQs downloaded via SRA_Fetch and basespace_fetch wfs
Awesome work @kapsakcj!! This version completed testing successfully: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Scribner_Sandbox/job_history/8c1da8d2-bcfd-4ecb-8001-82d663930c34. I'm good with approving this since it fixes the issue but one thing that is less than optimal is that in the output files the samples are named with the index. For example, sample name = "Campylobacter-jejunii-1.1-BHQL12.2023-04-25.01-B02-S222" in the pairwise matrix. It might look nice to first rename the read files using the sample names then replace underscores with hyphens afterwards. Just a nice to have - let me know what you think! |
@frankambrosio3 I believe the error you observed for this workflow is an indication of running out of disk space, not related to OOM: Could not write 65536 bytes: Error 4
mergeVcf.sh: ERROR with bcftools merge
# [shortened for brevity]
QSUB ERROR
256
launch_set.pl: Schedule::SGELK::command: ERROR with command: Inappropriate ioctl for device I've increased the default disk size to 250GB, but I want to keep the default memory at 64GB. It can get expensive fast if we request 128GB memory for every workflow and I don't believe it actually requires that much RAM. I'm re-running your workflow with the increased disk_size and default memory now (still running at the moment), but as long as it succeeds, I don't think it's necessary to increase the default memory: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/7a611f0f-de65-4965-8dc5-f84327ba9a24 |
@michellescribner thank you! I appreciate the feedback about output sample names, I'll have to think about this one. Not sure how we would accomplish that. |
We can hold off for now then in the interest of getting the patch out! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Curtis!!!
Setting as a draft for now, need to do further testing in Terra prior to marking ready for review
🛠️ Changes Being Made
tasks/phylogenetic_inference/task_lyveset.wdl
changes?
fromread_cleaner
input declaration. Also switched to double quotes from single quotes_R{1,2}.fastq.gz
to end in_{1,2}.fastq.gz
. See WDL for full description🧠 Context and Rationale
The lyveSET workflow, more specifically
shuffleSplitReads.pl
script for merging R1 and R2 FASTQ files is unable to parse FASTQ files that end in_R1.fastq.gz
or_R2.fastq.gz
So any FASTQ file brought into Terra with basespace_fetch or sra_fetch workflows would not be able to be used as input for lyveset.
Along the way I made other improvements to the task.
📋 Workflow/Task Steps
Inputs
Outputs
🧪 Testing
Locally
Tested many times locally, was able to run successfully with miniwdl
Terra
Will populate this after testing is finished.
🔬 Quality checks
Pull Request (PR) checklist: