-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Why two clustering steps? #74
Comments
Dear Camille,
thank you for reaching out to us.
For my understanding, this step was initially needed due to the high sequencing error rates.
It could happen that some reads gather so many errors in the UMI sequence that identical UMIs create two or more clusters. After a first round of consensus sequence creation, these UMIs should be corrected and now identical. A second round of clustering will now "combine" these two UMI clusters and correct this initial faulty clustering.
As the sequencing error rate decreases in every iteration of the nanopore models, keeping this second round of clustering might not be necessary for future basecalling models.
I hope this answers your question.
Kind regards,
Stephan
|
Hi Stephan, Thanks a lot for your quick reply. What you explain does make sense to me. My question has actually been motivated by the loss in coverage which induced by the two clustering steps, at with the data I have worked with and the parameters that have been set for VSEARCH. I would rather cluster only once and keep more consensus sequences to carry other downstream analyses... Regards, Camille |
Hi Camille, I agree that the complete second round of UMI extraction, clustering, and consensus sequence creation might not be necessary anymore, due to the big improvement in sequencing quality. Accordingly in our data, I never see a difference between consensus (1 round) and final consensus sequences (2 rounds). If your question is related to the cluster quality control step is explained in detail in our recent paper. In brief, we saw vsearch clustering to cause admixed UMI clusters that contain UMI sequences that should be in two separate clusters. This quality control step improved consensus sequence quality and variant level detection. I hope this answers your question. |
Hi Stephan, I indeed hope I can skip the second round of UMI extraction, clustering, and consensus creation that outputs final consensus sequences, because so far in my data it has decreased the number of consensus sequences. That is, number of consensus sequences (1 round) > number of final consensus sequences (2 rounds), which is a concern to me since it means that the second round cuts the coverage metrics. I actually want to keep the coverage large enough in order to carry accurate (rare) variant calling in a later step. Thanks a lot for your insights! |
Hello,
As far as I understand, the pipeline executes the steps of extracting UMIs and clustering the sequences based on those UMIs twice: once in the raw reads, and once again in the consensus reads that derive from the clustered and polished reads.
The original pipeline (https://github.com/nanoporetech/pipeline-umi-amplicon) proceeds in the same way.
Why is the second round of UMIs extraction and reads clustering necessary?
I have not managed to understand the reason for repeating those steps in the documentation of the pipelines, nor in the article published by Karst et al. (2021).
Thanks a lot if you can help me clarifying this.
Regards,
Camille C.
The text was updated successfully, but these errors were encountered: