Question: Why two clustering steps? #74

camcl · 2024-11-15T12:29:50Z

Hello,

As far as I understand, the pipeline executes the steps of extracting UMIs and clustering the sequences based on those UMIs twice: once in the raw reads, and once again in the consensus reads that derive from the clustered and polished reads.
The original pipeline (https://github.com/nanoporetech/pipeline-umi-amplicon) proceeds in the same way.

Why is the second round of UMIs extraction and reads clustering necessary?

I have not managed to understand the reason for repeating those steps in the documentation of the pipelines, nor in the article published by Karst et al. (2021).

Thanks a lot if you can help me clarifying this.

Regards,

Camille C.

AmstlerStephan · 2024-11-15T13:46:58Z

Dear Camille, thank you for reaching out to us. For my understanding, this step was initially needed due to the high sequencing error rates. It could happen that some reads gather so many errors in the UMI sequence that identical UMIs create two or more clusters. After a first round of consensus sequence creation, these UMIs should be corrected and now identical. A second round of clustering will now "combine" these two UMI clusters and correct this initial faulty clustering. As the sequencing error rate decreases in every iteration of the nanopore models, keeping this second round of clustering might not be necessary for future basecalling models. I hope this answers your question. Kind regards, Stephan

camcl · 2024-11-18T12:48:00Z

Hi Stephan,

Thanks a lot for your quick reply. What you explain does make sense to me. My question has actually been motivated by the loss in coverage which induced by the two clustering steps, at with the data I have worked with and the parameters that have been set for VSEARCH. I would rather cluster only once and keep more consensus sequences to carry other downstream analyses...
After comparing the release dates of the Guppy and Dorado basecallers against the creation date of the repository and most of the timestamps of the commits in nanoporetech's pipeline, it seems that the pipeline was implemented before Dorado was released. As I have understood, Dorado should achieve higher basecalling accuracy than Guppy so that I hope I can run a single clustering step if I use Dorado and a super accurate model (SUP) for basecalling.

Regards,

Camille

AmstlerStephan · 2024-12-03T16:17:50Z

Hi Camille,
please excuse the late reply!
Is your question related to the second round of UMI extraction, clustering, and consensus creation two the second clustering step, the cluster quality control (process reformat_filter_cluster)?

I agree that the complete second round of UMI extraction, clustering, and consensus sequence creation might not be necessary anymore, due to the big improvement in sequencing quality. Accordingly in our data, I never see a difference between consensus (1 round) and final consensus sequences (2 rounds).

If your question is related to the cluster quality control step is explained in detail in our recent paper. In brief, we saw vsearch clustering to cause admixed UMI clusters that contain UMI sequences that should be in two separate clusters. This quality control step improved consensus sequence quality and variant level detection.

I hope this answers your question.

camcl · 2024-12-06T12:12:19Z

Hi Stephan,

I indeed hope I can skip the second round of UMI extraction, clustering, and consensus creation that outputs final consensus sequences, because so far in my data it has decreased the number of consensus sequences. That is, number of consensus sequences (1 round) > number of final consensus sequences (2 rounds), which is a concern to me since it means that the second round cuts the coverage metrics. I actually want to keep the coverage large enough in order to carry accurate (rare) variant calling in a later step.
However, I have not tried yet to modify the original clustering strategy as you did by introducing a cluster quality control step. Maybe it can be the reason why I observe a difference in the number clusters between the two rounds. I came across the section mentioning this quality control in your publication, but I have not taken any closer look at the code yet.

Thanks a lot for your insights!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Why two clustering steps? #74

Question: Why two clustering steps? #74

camcl commented Nov 15, 2024

AmstlerStephan commented Nov 15, 2024 via email •

edited

Loading

camcl commented Nov 18, 2024

AmstlerStephan commented Dec 3, 2024

camcl commented Dec 6, 2024 •

edited

Loading

Question: Why two clustering steps? #74

Question: Why two clustering steps? #74

Comments

camcl commented Nov 15, 2024

AmstlerStephan commented Nov 15, 2024 via email • edited Loading

camcl commented Nov 18, 2024

AmstlerStephan commented Dec 3, 2024

camcl commented Dec 6, 2024 • edited Loading

AmstlerStephan commented Nov 15, 2024 via email •

edited

Loading

camcl commented Dec 6, 2024 •

edited

Loading