strange length binning? #67

kevfengler227 · 2024-11-07T19:44:25Z

I am just realizing that my herro datasets exhibit a strange convergence of reads lengths at defined size intervals. The raw reads are various lengths but the herro reads pile-up at certain sizes. The reads are unique in the raw data and unique in the herro data, mapping to various locations in the genome. Yet many of them are the same length. Is there an explanation for this?

herro	raw	ID
53,278	56,718	01fb30f5-d464-4bb5-b276-895d3f8c34ff
53,278	56,876	02a69f5f-afec-4aed-97d4-d68f1c3bf6e1
53,278	81,115	03aa28fb-90b0-458f-a5a1-0d5cfc238595:0
53,278	53,268	04d55d62-d155-4264-b052-9977970cd372
53,278	53,252	050b6fe3-92a3-40bd-b4ad-2ed28f091c3e
53,278	63,875	07ac7b7b-8814-4cf2-8638-04fe21390ff5:0
53,278	58,100	08f717ef-863d-4b63-9f1b-cf195f374d2a
53,278	114,672	097c9372-418d-456e-ae9a-a9082ca72e3d
53,278	53,285	0b979a87-043f-47b6-ab9e-22380c04db32
53,278	58,583	0ea5941b-fb88-401d-bc26-3c82e6b8778e
53,278	107,401	0fb884e2-9cde-4948-8634-0e05b8c53ee5

kevfengler227 · 2024-11-07T19:44:34Z

dehui333 · 2024-11-11T06:21:13Z

Hi @kevfengler227

HERRO does correction in windows/chunks, for which the default size is 4096 base pairs.

If a read being corrected has a low number of overlaps in a window, the read would be truncated in the region corresponding to the window, possibly resulting in multiple output reads from a single input read. Such reads have :INTEGER appended to the output read ids, e.g. a_read:0, a_read:1 are two reads resulting from the read 'a_read'.

Since windows are of 4096bp (default window size), the truncated reads tend to have lengths that are around multiples of 4096, which would explain the peaks at ~4k intervals. I suppose one way to confirm that this is the cause for your dataset would be to count the number of reads with :INTEGER appended to their read ids for reads that fall into the different length ranges - I think there would be a much higher proportion of reads with :INTEGER appended for those in the ranges with the peaks.

It is also possible that the only windows with insufficient overlaps are at the sides, in this case, the read would be trimmed at the sides but only one read is produced so :INTEGER would not be appended.

Best,
Dehui

kevfengler227 · 2024-11-11T15:05:05Z

OK, thanks. Indeed, the corrected reads piling up at ~4k intervals are split reads. The impact of this behavior is quite striking producing the most unusual rad length histogram I have ever seen!

12259 fe77f842-422b-4798-a2f0-d3e4f268349e:0
20481 fe77f842-422b-4798-a2f0-d3e4f268349e:1
29976 fe77f842-422b-4798-a2f0-d3e4f268349e:2

However, I guess I don't fully understand why the same 4.k bins are used, which produce pileups at 4.1 kb, 8.2 kb, 12.3 kb, 16.4 kb etc rather than tiled or random 4kb bins which would not produce these pileups at fixed intervals?

dehui333 · 2024-11-18T08:44:11Z

Hi @kevfengler227,

Sorry for the delayed response.

I suppose having the windows in the current way is the simplest way to do it, and that's why it is done this way.

Do you see any potential issues with having these peaks in the read length histogram?

kevfengler227 · 2024-11-19T02:45:16Z

only if hifiasm or verkko have a problem with it

dominikstanojevic assigned dehui333 Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strange length binning? #67

strange length binning? #67

kevfengler227 commented Nov 7, 2024

kevfengler227 commented Nov 7, 2024

dehui333 commented Nov 11, 2024 •

edited

Loading

kevfengler227 commented Nov 11, 2024

dehui333 commented Nov 18, 2024

kevfengler227 commented Nov 19, 2024

strange length binning? #67

strange length binning? #67

Comments

kevfengler227 commented Nov 7, 2024

kevfengler227 commented Nov 7, 2024

dehui333 commented Nov 11, 2024 • edited Loading

kevfengler227 commented Nov 11, 2024

dehui333 commented Nov 18, 2024

kevfengler227 commented Nov 19, 2024

dehui333 commented Nov 11, 2024 •

edited

Loading