Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange length binning? #67

Open
kevfengler227 opened this issue Nov 7, 2024 · 5 comments
Open

strange length binning? #67

kevfengler227 opened this issue Nov 7, 2024 · 5 comments
Assignees

Comments

@kevfengler227
Copy link

I am just realizing that my herro datasets exhibit a strange convergence of reads lengths at defined size intervals. The raw reads are various lengths but the herro reads pile-up at certain sizes. The reads are unique in the raw data and unique in the herro data, mapping to various locations in the genome. Yet many of them are the same length. Is there an explanation for this?

<style> </style>
herro raw ID
53,278 56,718 01fb30f5-d464-4bb5-b276-895d3f8c34ff
53,278 56,876 02a69f5f-afec-4aed-97d4-d68f1c3bf6e1
53,278 81,115 03aa28fb-90b0-458f-a5a1-0d5cfc238595:0
53,278 53,268 04d55d62-d155-4264-b052-9977970cd372
53,278 53,252 050b6fe3-92a3-40bd-b4ad-2ed28f091c3e
53,278 63,875 07ac7b7b-8814-4cf2-8638-04fe21390ff5:0
53,278 58,100 08f717ef-863d-4b63-9f1b-cf195f374d2a
53,278 114,672 097c9372-418d-456e-ae9a-a9082ca72e3d
53,278 53,285 0b979a87-043f-47b6-ab9e-22380c04db32
53,278 58,583 0ea5941b-fb88-401d-bc26-3c82e6b8778e
53,278 107,401 0fb884e2-9cde-4948-8634-0e05b8c53ee5
@kevfengler227
Copy link
Author

image

@dehui333
Copy link
Collaborator

dehui333 commented Nov 11, 2024

Hi @kevfengler227

HERRO does correction in windows/chunks, for which the default size is 4096 base pairs.

If a read being corrected has a low number of overlaps in a window, the read would be truncated in the region corresponding to the window, possibly resulting in multiple output reads from a single input read. Such reads have :INTEGER appended to the output read ids, e.g. a_read:0, a_read:1 are two reads resulting from the read 'a_read'.

Since windows are of 4096bp (default window size), the truncated reads tend to have lengths that are around multiples of 4096, which would explain the peaks at ~4k intervals. I suppose one way to confirm that this is the cause for your dataset would be to count the number of reads with :INTEGER appended to their read ids for reads that fall into the different length ranges - I think there would be a much higher proportion of reads with :INTEGER appended for those in the ranges with the peaks.

It is also possible that the only windows with insufficient overlaps are at the sides, in this case, the read would be trimmed at the sides but only one read is produced so :INTEGER would not be appended.

Best,
Dehui

@kevfengler227
Copy link
Author

OK, thanks. Indeed, the corrected reads piling up at ~4k intervals are split reads. The impact of this behavior is quite striking producing the most unusual rad length histogram I have ever seen!

12259 fe77f842-422b-4798-a2f0-d3e4f268349e:0
20481 fe77f842-422b-4798-a2f0-d3e4f268349e:1
29976 fe77f842-422b-4798-a2f0-d3e4f268349e:2

However, I guess I don't fully understand why the same 4.k bins are used, which produce pileups at 4.1 kb, 8.2 kb, 12.3 kb, 16.4 kb etc rather than tiled or random 4kb bins which would not produce these pileups at fixed intervals?

@dehui333
Copy link
Collaborator

Hi @kevfengler227,

Sorry for the delayed response.

I suppose having the windows in the current way is the simplest way to do it, and that's why it is done this way.

Do you see any potential issues with having these peaks in the read length histogram?

@kevfengler227
Copy link
Author

only if hifiasm or verkko have a problem with it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants