Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Why two clustering steps? #74

Open
camcl opened this issue Nov 15, 2024 · 4 comments
Open

Question: Why two clustering steps? #74

camcl opened this issue Nov 15, 2024 · 4 comments

Comments

@camcl
Copy link

camcl commented Nov 15, 2024

Hello,

As far as I understand, the pipeline executes the steps of extracting UMIs and clustering the sequences based on those UMIs twice: once in the raw reads, and once again in the consensus reads that derive from the clustered and polished reads.
The original pipeline (https://github.com/nanoporetech/pipeline-umi-amplicon) proceeds in the same way.

Why is the second round of UMIs extraction and reads clustering necessary?

I have not managed to understand the reason for repeating those steps in the documentation of the pipelines, nor in the article published by Karst et al. (2021).

Thanks a lot if you can help me clarifying this.

Regards,

Camille C.

@AmstlerStephan
Copy link
Member

AmstlerStephan commented Nov 15, 2024 via email

@camcl
Copy link
Author

camcl commented Nov 18, 2024

Hi Stephan,

Thanks a lot for your quick reply. What you explain does make sense to me. My question has actually been motivated by the loss in coverage which induced by the two clustering steps, at with the data I have worked with and the parameters that have been set for VSEARCH. I would rather cluster only once and keep more consensus sequences to carry other downstream analyses...
After comparing the release dates of the Guppy and Dorado basecallers against the creation date of the repository and most of the timestamps of the commits in nanoporetech's pipeline, it seems that the pipeline was implemented before Dorado was released. As I have understood, Dorado should achieve higher basecalling accuracy than Guppy so that I hope I can run a single clustering step if I use Dorado and a super accurate model (SUP) for basecalling.

Regards,

Camille

@AmstlerStephan
Copy link
Member

Hi Camille,
please excuse the late reply!
Is your question related to the second round of UMI extraction, clustering, and consensus creation two the second clustering step, the cluster quality control (process reformat_filter_cluster)?

I agree that the complete second round of UMI extraction, clustering, and consensus sequence creation might not be necessary anymore, due to the big improvement in sequencing quality. Accordingly in our data, I never see a difference between consensus (1 round) and final consensus sequences (2 rounds).

If your question is related to the cluster quality control step is explained in detail in our recent paper. In brief, we saw vsearch clustering to cause admixed UMI clusters that contain UMI sequences that should be in two separate clusters. This quality control step improved consensus sequence quality and variant level detection.

I hope this answers your question.

@camcl
Copy link
Author

camcl commented Dec 6, 2024

Hi Stephan,

I indeed hope I can skip the second round of UMI extraction, clustering, and consensus creation that outputs final consensus sequences, because so far in my data it has decreased the number of consensus sequences. That is, number of consensus sequences (1 round) > number of final consensus sequences (2 rounds), which is a concern to me since it means that the second round cuts the coverage metrics. I actually want to keep the coverage large enough in order to carry accurate (rare) variant calling in a later step.
However, I have not tried yet to modify the original clustering strategy as you did by introducing a cluster quality control step. Maybe it can be the reason why I observe a difference in the number clusters between the two rounds. I came across the section mentioning this quality control in your publication, but I have not taken any closer look at the code yet.

Thanks a lot for your insights!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants