Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend from 3'end #538

Open
avilella opened this issue Nov 1, 2024 · 3 comments
Open

extend from 3'end #538

avilella opened this issue Nov 1, 2024 · 3 comments

Comments

@avilella
Copy link

avilella commented Nov 1, 2024

Can medaka generate a consensus that extends from the soft-clipped 3'end of ONT reads mapped to a reference?

E.g. for B-cell repertoire or T-cell repertoire transcript sequencing with ONT, one can map the reads onto the V-gene sequence, which will look as shown below:

V-Gene ====================================================
read1  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxccccccccccccccffffffffffffff
read2  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxcccccccccccccccffffffffffffff
read3  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxixxxxxxxxxxxxxxxxxxxcccccccccccccffffffffffffff
read4  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxccccccccccccccffffffffffffff
...

The x part of the ONT reads map to the V-Gene, there may be mismatches due to hypermutation, which should be dealt the same way as SNVs in genomic variant calling. The c part is the CDR3 region which is unique to each cell, and doesn't have a reference. The f part is the FWR4, which continues past the CDR3 region, and doesn't align to the V-gene. There could be i insertions and - deletions, which when they are in the V-gene mapping region, are always sequencing errors, as there is no indels in the V-gene part.

Given a .bam file of reads mapping to their corresponding V-gene reference, how do I run medaka to obtain the consensus sequence that includes the CDR3 and FWR4 parts that don't map the V-gene reference?

Thanks in advance.

@ftostevin-ont
Copy link
Contributor

Any bases that are soft-clipped in the read-to-reference bam file will be ignored when generating features that are used for consensus inference. There is not a straightforward way to remove this restriction. To extend the consensus into the FWR4 region, you would need to extend the reference sequence to include the CDR3/FWR4 regions.

Alternatively, you could try using medaka smolecule, which first generates a POA of the reads and then performs a consensus of alignments to the POA consensus sequence. This should span the full length of the reads, though the accuracy will be limited by how well the variable CDR3 region can be aligned in the POA.

@avilella
Copy link
Author

avilella commented Nov 5, 2024 via email

@ftostevin-ont
Copy link
Contributor

This may work but it seems simpler just to use the real reads. Any sequencing errors should be removed by the POA and medaka consensus steps while genuine variants would be retained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants