Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Artic] Large overhaul for newer versions supporting clair3 #715

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

Michal-Babins
Copy link
Contributor

@Michal-Babins Michal-Babins commented Jan 8, 2025

This PR closes #697

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

Updates to the latest version of Artic (1.6.0 at the time of writing) from previous version 1.3.0 being used. As of Artic 1.5.1 medaka is no longer supported in favor of clair3. The artic minion medaka is dropped in favor of artic minion with other arguments being changed as well. In order to support the new version and multiple clair3 model options, a new docker image has been made: https://github.com/theiagen/theiagen_docker_builds/blob/mb-clair3-models/artic-ncov2019/1.6.0_rerio/Dockerfile. This PR introduces a large overhaul to task_artic_consensus and only changes the docker image in task_artic_guppyplex. These changes impact wf_theiacov_ont and wf_theacov_clearlabs.

⚡ Impacted Workflows/Tasks

Workflows:

  • wf_theiacov_ont
  • wf_theiacov_clearlabs
    Tasks:
  • task_artic_consensus
  • task_artic_guppyplex

This PR may lead to different results in pre-existing outputs: Yes

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

Small changes to task_artic_guppyplex:
us-docker.pkg.dev/general-theiagen/staphb/artic-ncov2019:1.3.0-medaka-1.4.3 -> us-docker.pkg.dev/general-theiagen/theiagen/artic:1.6.0_rerio
Remove echo "DIRNAME: $(dirname)" since variable never set.

Large change to task_artic_consensus:
The ARTIC pipeline underwent a significant overhaul to support new versions, transitioning from Medaka's variant calling system (r941_min_high_g360 as default model) to the more modern Clair3 (r1041_e82_400bps_hac_v420 as default model). The new artic command replaces complex nested scheme directories (/Vuser) with direct reference handling through --bed and --ref flags. The pipeline also supports remote scheme fetching from primerschemes repo and expands organism coverage to include MPXV alongside SARS-CoV-2 for scenarios where no primer bed and no reference genome are provided. We are also supporting more stringent flags for extracting aligned reads (moving from -F4 to -F0x904). A new Docker image (us-docker.pkg.dev/general-theiagen/theiagen/artic:1.6.0_rerio) is introduced.

Input additions and output changes to wf_theiacov_ont:
Inputs:
String? clair3_model added as optional input and passed to call artic_consensus.consensus,
Outputs:
File? read1_trimmed deprecated
medaka_vcf -> clair3_vcf
medaka_reference -> artic_reference

Input additions and output changes to wf_theiacov_clearlabs:
Inputs:
String? clair3_model added as optional input and passed to call artic_consensus.consensus,
Outputs:
File variants_from_ref_vcf = consensus.medaka_pass_vcf -> File variants_from_ref_vcf = consensus.artic_clair3_pass_vcf
String medaka_reference = consensus.medaka_reference -> String artic_reference = consensus.artic_pipeline_reference

⚙️ Algorithm

New ARTIC/Clair3 Pipeline Flow:

  1. Read Mapping
  • minimap2: ONT reads → reference (-x map-ont)
  • BAM conversion, unmapped read filtering, sorting
  1. Primer Processing
  • Dual align_trim passes:
  1. Coverage normalization (200x) + pair validation
  2. Primer trimming + amplicon depth calculation
  3. Variant Detection
  • Clair3 variant calling per read group:
  • ONT model: r1041_e82_400bps_hac_v420 as default (can be changed)
  • Settings: haploid mode, min coverage 20x, long indel detection
  1. Consensus Building
  • Variant merging and filtration
  • Coverage masking
  • bcftools consensus for final sequence generation

The newest version of Artic also changes up how the base command is instantiated, so everything is routed through artic minion, where if bed file and reference are provided we run:

 artic minion --model ~{clair3_model} \
        --normalise ~{normalise} \
        --threads ~{cpu} \
        --bed ~{primer_bed} \
        --ref reference.fasta \
        --read-file ~{read1} \
        ~{samplename}

NOTE:
Newer versions of Artic has changed how bed files are being parsed. There are two instances in which this caused existing primer beds to fail in testing, but worked when updated. HIV primer beds and Midngight primer beds. We can update and host these newer beds, but for now the current locations are here:
Location of updated Midnight primers: gs://fc-secure-50c9efc6-4ca8-4bf5-9752-5bd6a6da17dd/Midnight_Primers_SARS-CoV-2.scheme_updated.bed
Location of updated HIV primers: gs://fc-secure-50c9efc6-4ca8-4bf5-9752-5bd6a6da17dd/HIV-1_v2.0.primer.hyphen1200.1.bed

➡️ Inputs

In wf_theiacov_ont and wf_theiacov_clearlabs:

  • String? clair3_model has been added

In wf_theiacov_clearlabs:
medaka_docker -> artic_docker_image

In task_artic_consensus:

  • String medaka_model -> String clair3_model

⬅️ Outputs

In wf_theiacov_ont:
File? read1_trimmed deprecated
File medaka_vcf -> File clair3_vcf
String medaka_reference -> String artic_reference

In wf_theiacov_clearlabs:
File variants_from_ref_vcf = consensus.medaka_pass_vcf -> File variants_from_ref_vcf = consensus.artic_clair3_pass_vcf
String medaka_reference = consensus.medaka_reference -> String artic_reference = consensus.artic_pipeline_reference

In task_artic_consensus:

  • {samplename}.medaka.consensus.fasta -> {samplename}.consensus.fasta
  • medaka_reference -> artic_pipeline_reference
  • medaka_pass_vcf -> artic_clair3_pass_vcf
  • trim_fastq: {samplename}.primertrimmed.rg.fastq has been removed

Note: I tried to keep some of the naming more neutral to artic since clair3 technically is just doing the variant calling. I am completely open to any naming scheme we may want to adhere to.

🧪 Testing

Tests performed against non-hiv organisms in the theiacov_ont validation data
Hiv test was performed separately to use updated primers that work with it here, the primer bed file is currently just uploaded to my sandbox: gs://fc-secure-50c9efc6-4ca8-4bf5-9752-5bd6a6da17dd/HIV-1_v2.0.primer.hyphen1200.1.bed
Similarly with clearlabs, the primers had to be updated in order for the new version of ARTIC to work, test can be found here, and the primer bed is here: gs://fc-secure-50c9efc6-4ca8-4bf5-9752-5bd6a6da17dd/Midnight_Primers_SARS-CoV-2.scheme_updated.bed

Here is another test case with Puerto Rico ONT data that passes when updated primer bed files are used, but fails with the current primer bed.
Puerto Rico uses the V3 Midnight Primers: gs://theiagen-public-files/terra/titan-files/SARS-CoV-2.Midnight-ONT.V3.scheme.bed, the updated ones are currently here: gs://fc-secure-50c9efc6-4ca8-4bf5-9752-5bd6a6da17dd/SARS-CoV-2.Midnight-ONT.V3.scheme_updated.bed

I also tested a scenario where sars-cov-2 is set as default with no primer bed as input and empty.bed get's selected between the new version and current version to confirm they both fail.

On terra, all testing was done with provided bed file and reference picked up by organism parameters, to hit the scheme autodetection I tested locally directly against the task. I am happy to provide the test data if the reviewer wishes to repeat these tests.
For sars-cov-2:
miniwdl run tasks/quality_control/read_filtering/task_artic_guppyplex.wdl read1=barcode01.fastq.gz samplename=test,
For mpox:
miniwdl run tasks/assembly/task_artic_consensus.wdl samplename=test read1=barcode01.fastq.gz organism=MPXV

Suggested Scenarios for Reviewer to Test

Testing on theiacov_ont and theiacov_clearlabs validation sets will be a good confirmation. Please test with any other data that is relevant to this workflow. We will need to update the primer bed files before merging this PR.

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable
    • You have updated the latest version for any affected worklows in the respective workflow documentation page and for every entry in the three workflows_overview tables.

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

…nce genome, and add scheme length for remote schemes
@Michal-Babins Michal-Babins marked this pull request as ready for review January 15, 2025 17:54
@Michal-Babins Michal-Babins requested a review from a team as a code owner January 15, 2025 17:54
@Michal-Babins Michal-Babins marked this pull request as draft January 15, 2025 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TheiaProk_ONT] Artic 1.5.1 onwards no longer supports medaka & moved to clair3
1 participant