-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strange length binning? #67
Comments
HERRO does correction in windows/chunks, for which the default size is 4096 base pairs. If a read being corrected has a low number of overlaps in a window, the read would be truncated in the region corresponding to the window, possibly resulting in multiple output reads from a single input read. Such reads have :INTEGER appended to the output read ids, e.g. a_read:0, a_read:1 are two reads resulting from the read 'a_read'. Since windows are of 4096bp (default window size), the truncated reads tend to have lengths that are around multiples of 4096, which would explain the peaks at ~4k intervals. I suppose one way to confirm that this is the cause for your dataset would be to count the number of reads with :INTEGER appended to their read ids for reads that fall into the different length ranges - I think there would be a much higher proportion of reads with :INTEGER appended for those in the ranges with the peaks. It is also possible that the only windows with insufficient overlaps are at the sides, in this case, the read would be trimmed at the sides but only one read is produced so :INTEGER would not be appended. Best, |
OK, thanks. Indeed, the corrected reads piling up at ~4k intervals are split reads. The impact of this behavior is quite striking producing the most unusual rad length histogram I have ever seen! 12259 fe77f842-422b-4798-a2f0-d3e4f268349e:0 However, I guess I don't fully understand why the same 4.k bins are used, which produce pileups at 4.1 kb, 8.2 kb, 12.3 kb, 16.4 kb etc rather than tiled or random 4kb bins which would not produce these pileups at fixed intervals? |
Hi @kevfengler227, Sorry for the delayed response. I suppose having the windows in the current way is the simplest way to do it, and that's why it is done this way. Do you see any potential issues with having these peaks in the read length histogram? |
only if hifiasm or verkko have a problem with it |
I am just realizing that my herro datasets exhibit a strange convergence of reads lengths at defined size intervals. The raw reads are various lengths but the herro reads pile-up at certain sizes. The reads are unique in the raw data and unique in the herro data, mapping to various locations in the genome. Yet many of them are the same length. Is there an explanation for this?
<style> </style>The text was updated successfully, but these errors were encountered: