Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added 2 QC thresholds to ANI task to reduce false positives #168

Merged
merged 28 commits into from
Dec 5, 2023

Conversation

kapsakcj
Copy link
Contributor

@kapsakcj kapsakcj commented Aug 24, 2023

FYI: leaving in a draft state until further testing is done and feedback is received. Code changes are pretty much finished.
We may end up adjusting the default threshold of 92.0 to a different value. adjusted to 80.0

All testing is finished. This PR is ready for review 👀

🛠️ Changes Being Made

tasks/quality_control/task_mummer_ani.wdl changes

  • added new Float input ani_threshold = 80.0 as per CDC EDLB/PulseNet standard
  • added new Float input percent_bases_aligned_threshold = 70.0 as per CDC EDLB/PulseNet standard
  • exposed cpus and memory Integer optional inputs.
  • added .txt suffixes to output files to help differentiate between these and bash variables
  • IMPORTANT CHANGE: added logic for comparing ANI_HIGHEST_PERCENTAGE to ani_threshold and only outputting the name of the match if both thresholds are surpassed. If threshold is not surpassed, it will print a message saying ANI top species match did not surpass the user-defined threshold of ~{ani_threshold}
  • IMPORTANT CHANGE: added logic for comparing ANI_HIGHEST_PERCENT_BASES_ALIGNED to percent_bases_aligned_threshold and only outputting the name of the match if both thresholds are surpassed. If threshold is not surpassed, it will print a message saying ANI percent bases aligned did not surpass the user-defined threshold of ~{percent_bases_aligned_threshold}
  • added required string output ani_docker

TheiaProk Illumina PE, SE, ONT, and FASTA workflow changes:

  • added new string output ani_docker to all workflows
  • added ani_docker string output to call block for export_taxon_tables task

tasks/utilities/task_broad_terra_tools.wdl/export taxon tables task

  • added ani_docker string input to this task

🧠 Context and Rationale

Request from CDPH PulseNet group.

For example: we do not want to see an Enterobacter cloacae sample show the top ANI hit of Salmonella_enterica because that can be confusing to the user.

Adding this threshold will filter out a large majority, if not ALL false positive top matches where the species is not represented in the database (RGDv2, enteric pathogens) yet ANI & genetic relatedness is high enough to show a result.

📋 Workflow/Task Steps

Inputs

new optional inputs:

  • Float ani_threshold with default value of 85.00
  • Float percent_bases_aligned_threshold with default value of 70.0
  • Int cpus with default value of 4
  • Int memory with default value of 8

Outputs

New required output:

  • String ani_docker

🧪 Testing

Locally

tested locally with miniwdl

Terra

Successfully ran TheiaProk_Illumina_PE (and export_taxon_tables behaved as expected): https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/af44a279-9918-4d0d-823a-5dccdf65aec0

Tests TODO:

🔬 Quality checks

Pull Request (PR) checklist:

  • Include a description of what is in this pull request in this message.
  • The workflow/task has been tested locally and on Terra
  • The CI/CD has been adjusted and tests are passing
  • Everything follows the style guide

…and memory. added logic for comparing ANI_HIGHEST_PERCENTAGE to ani_threshold and only outputting the name of the match if the threshold is surpassed. tested successfully in miniwdl.
@kapsakcj
Copy link
Contributor Author

kapsakcj commented Aug 24, 2023

todos:

  • add new ani_docker output to:
    • Theiaprok fasta workflow
    • theiaprok illumina pe workflow
    • theiaprok illumina se workflow
    • theiaprok ont workflow
    • export_taxon_table task (as well as the call block within workflows)

@kapsakcj kapsakcj added the enhancement This issue is a new feature or request label Oct 6, 2023
@kapsakcj
Copy link
Contributor Author

After adding in the 2nd threshold (percent_bases_aligned_threshold), all false positives were eliminated in my dataset of ~20 various bacterial species.

Enteric species were output as the ani_top_species_match as expected ✅

and close relatives (but different genera) had warning messages in the ani_top_species_match output String, indicating that one of the 2 thresholds was not met ✅

Now to resolve these conflicts 🥲

@kapsakcj
Copy link
Contributor Author

OK, I've rebased this branch to be on top of the most recent main branch commits.

Likely will have to adjust CI hashes, but this cleans up the commit history so that all ANI related code changes in this PR occur "now" in the commit history, instead of back when I initially started this dev branch

@kapsakcj
Copy link
Contributor Author

kapsakcj commented Oct 10, 2023

Once the CI is updated and running successfully, I will relaunch tests on the various workflow to ensure functionality is retained after rebasing this branch

@kapsakcj
Copy link
Contributor Author

FYI I have merged origin/main into this branch which totally offsets/nullifies the rebasing I did. Should make the "files changed" section much simpler and easier to review code changes

@kapsakcj kapsakcj changed the title added ani_threshold Float input to animummer task. Other task improvements. added 2 QC thresholds to ANI task to reduce false positives Oct 20, 2023
@kapsakcj kapsakcj marked this pull request as ready for review November 28, 2023 22:25
Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sage-wright sage-wright merged commit d68a45f into main Dec 5, 2023
12 checks passed
@kapsakcj kapsakcj deleted the cjk-ani-threshold branch December 22, 2023 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This issue is a new feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants