Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate the use of cg_pipeline for nanoplot stats #164

Merged
merged 29 commits into from
Aug 28, 2023
Merged

Conversation

cimendes
Copy link
Member

@cimendes cimendes commented Aug 24, 2023

Partially closes # 158

🛠️ Changes Being Made

This PR removes cg_pipeline as the tool for read statistics calculation (mean quality score, among others) and instead replaces it by nanoplot stats.

The following fields were removed from the outputs in TheiaProk:

  • cg_pipeline_report_raw
  • cg_pipeline_docker
  • cg_pipeline_report_clean
  • number_raw_reads
  • number_clean_reads
  • fastq_scan_version
  • fastq_scan_report

The following fields were introduced in the datatable in TheiaProk:

  • nanoplot_version
  • nanoplot_docker
  • nanoplot_html_raw
  • nanoplot_tsv_raw
  • nanoplot_num_reads_raw1
  • nanoplot_r1_mean_readlength_raw
  • nanoplot_r1_mean_q_raw
  • nanoplot_html_clean
  • nanoplot_tsv_clean
  • nanoplot_num_reads_clean1
  • nanoplot_r1_mean_readlength_clean
  • nanoplot_r1_mean_q_clean

This PR also brings TheiaCoV_ONT and TheiaProk_ONT into alignment regarding workflow logic and tools used.

🧠 Context and Rationale

From @emmadoughty

Mean Q score: Different results output to Terra data table and in Nanoplot HTML "mean_qual" result? Nanoplot max quality seems to be above mean quality output to Terra table? Perhaps quality assessment in cg_pipeline isn't working well for Nanopore data? Some samples are also getting unrealistically high Q scores, e.g. >80! Could we output Nanoplot mean Q score to the data table?

📋 Workflow/Task Steps

Nanoplot performs statistical analysis of read length distribution and read quality. It now completely replaces cg_pipeline on both raw and clean read quality assessments. The nanoplot task was shown some love by exposing docker, memory and CPU arguments, and exposing docker as output.

Inputs

N/A

Outputs

The following fields were introduced in both TheiaCoV and TheiaProk:

  • nanoplot_version
  • nanoplot_docker
  • nanoplot_html_raw
  • nanoplot_tsv_raw
  • (nanoplot_)num_reads_raw1
  • (nanoplot_)r1_mean_readlength_raw
  • (nanoplot_)r1_mean_q_raw
  • nanoplot_html_clean
  • nanoplot_tsv_clean
  • (nanoplot_)num_reads_clean1
  • (nanoplot_)r1_mean_readlength_clean
  • (nanoplot_)r1_mean_q_clean

For the above variables in TheiaCoV, the nanoplot_ prefix was not included as to not alter existing output names for users.

In TheiaCoV only:

  • est_coverage_raw
  • est_coverage_clean

🧪 Testing

Locally

All tests passed locally

Terra

Tested on 44 ONT data samples on terra: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/f6f822af-6a5d-4efc-a8bb-e118fcadc756

TheiaProk_ONT: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Doughty_Sandbox/job_history/526ac0e1-78fa-4cd3-916a-e904ae716865

TheiaProk_ONT: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Doughty_Sandbox/job_history/8cfb3210-55d2-4a40-8966-0e895075f5b2

TheiaCoV_ONT: https://app.terra.bio/#workspaces/cdc-terra-resources/Theiagen_Wright_SC2_Sandbox/job_history/ed0f7349-5f21-4721-a106-0e0281971de0

🔬 Quality checks

Pull Request (PR) checklist:

  • Include a description of what is in this pull request in this message.
  • The workflow/task has been tested locally and on Terra
  • The CI/CD has been adjusted and tests are passing
  • Everything follows the style guide

@cimendes
Copy link
Member Author

cimendes commented Aug 24, 2023

Nanoplot to estimate coverage (similar to cg_pipeline)
cc @sage-wright

TODO:

  • remove nanoplot out of read_qc
  • pass assembly len from QUAST to nanoplot and calculate estimated coverage

@cimendes
Copy link
Member Author

Remove redundancy of fastq_scan

@sage-wright sage-wright linked an issue Aug 24, 2023 that may be closed by this pull request
5 tasks
sage-wright and others added 7 commits August 24, 2023 17:07
* make mash genome size estimation optional

* float contigs_gfa to table

* rename gfa

* add skip_mash to theiacov_ont

* update md5sum

---------

Co-authored-by: cimendes <[email protected]>
@cimendes
Copy link
Member Author

cimendes commented Aug 25, 2023

TheiaProk Testing

44 sample dataset run with main branch: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/17fd616b-cf16-453c-9280-4f7a594c8e67

44 sample dataset run with commit id 3129921: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/d4ef8d7d-fd99-4ef8-8400-31c1949593af

Deprecated columns:

  • cg_pipeline_docker
  • cg_pipeline_report_clean
  • cg_pipeline_report_raw
  • fastq_scan_version
  • nanoplot_html
  • num_reads_clean1
  • num_reads_raw1
  • pasty_comment <- @sage-wright is this intended?
  • r1_mean_q_raw
  • r1_mean_readlength_raw

New columns:

  • contigs_gfa
  • nanoplot_num_reads_clean1
  • nanoplot_num_reads_raw1
  • nanoplot_r1_mean_q_clean
  • nanoplot_r1_mean_q_raw
  • nanoplot_r1_mean_readlength_clean
  • nanoplot_r1_mean_readlength_raw
  • nanoplot_docker
  • nanoplot_html_clean
  • nanoplot_html_raw
  • nanoplot_tsv_clean
  • nanoplot_tsv_raw

@cimendes
Copy link
Member Author

TheiaCoV Testing

72 sample dataset with main branch: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/fe417b76-f79a-4004-aab4-5c792d44ffd9

72 sample dataset with commit id d367b10: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/16e494be-0547-4403-b187-2398b10e8fc4

Deprecated columns:

  • fastq_scan_version

New columns:

  • est_coverage_clean
  • est_coverage_raw
  • nanoplot_docker
  • nanoplot_html_clean
  • nanoplot_html_raw
  • nanoplot_tsv_clean
  • nanoplot_tsv_raw
  • nanoplot_version
  • r1_mean_q_clean
  • r1_mean_q_raw
  • r1_mean_readlength_clean
  • r1_mean_readlength_raw

@cimendes
Copy link
Member Author

With the exception of the pasty_comment column, all changes reflect correctly and as intended in the datatable. Awesome job @sage-wright ! For me it's an approve.

Copy link
Member

@sage-wright sage-wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work team!

@sage-wright sage-wright merged commit 38770c2 into main Aug 28, 2023
23 checks passed
@cimendes cimendes deleted the im-theiaprok-ont branch August 29, 2023 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TheiaCoV_ONT & TheiaProk_ONT
2 participants