Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBG construction from short and long reads #81

Open
SimonHegele opened this issue Sep 29, 2024 · 2 comments
Open

DBG construction from short and long reads #81

SimonHegele opened this issue Sep 29, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@SimonHegele
Copy link

Hello,

and thank you for this amazing tool. I am currently examining different methods of hybrid de novo transcriptome assembly.
I constructed various assemblies from mouse data and compared their results and the result of a Stringtie assembly with BUSCO and rnaQUAST. In terms of BUSCOs metrics and most of rnaQUASTs metrics RNA-bloom gave the best results so far.

However, the number of mismatches in the alignments of the transcripts to the reference genome is not significantly lower in the hybrid assembly compared to the long-read assembly (~2.34 per kb vs ~2.4 per kb). So I wondered if the long and short reads are in any way treated differently in the DBG graph construction. If not this might explain the small impact of the short reads.

Best,
simon

@kmnip kmnip self-assigned this Sep 30, 2024
@kmnip kmnip added the question Further information is requested label Sep 30, 2024
@kmnip
Copy link
Collaborator

kmnip commented Sep 30, 2024

Hello Simon,
If both short reads and long reads are used, the short reads only contribute to the de Bruijn graph and the k-mer multiplicities in the error correction step in long read assembly. So, RNA-Bloom most likely would be correcting mismatches and short indel errors, leaving the long indels untouched. I think I can possibly implement a more aggressive strategy to use only short-read k-mers only.
Ka Ming

@SimonHegele
Copy link
Author

Thank you for your reply. I think it it is generally a good idea to have the DGB constructed both from long and short reads so the long reads get corrected even in regions with no short read coverage. However, for accuracy one should trust the short reads over the long reads. Maybe this could be implemented simply by having the short reads contributing more to the k-mer multiplicity by adding a factor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants