Result differences from input bias #1918

JuantonioMS · 2022-08-25T14:22:14Z

Hi! I have an issue/question about the trim-low-abund.py script.

An example:
I have two diffent files -> file_1.fq.gz and file_2.fq.gz

The resulting output for file_1.fq.gz is not the same between this two cases:
(Case 1) trim-low-abund.py -C 3 -V -Z 18 -M 20G file_1.fq.gz
(Case 2) trim-low-abund.py -C 3 -V -Z 18 -M 20G file_1.fq.gz file_2.fq.gz

Case 1 file_1.fq.gz.abundtrim is not the same that Case 2 file_1.fq.gz.abundtrim. Is this behavior expected? What am I missing?

Thanks!

ctb · 2022-08-25T14:25:12Z

Hello, @JuantonioMS , thanks for asking!

yes, the increased coverage from file_2.fq.gz will affect the trimming for sequences in file_1.fq.gz as well as the false positive rate of the count min sketches. I would expect more sequences to be trimmed in file_1.fq.gz, in particular.

JuantonioMS · 2022-08-25T14:31:22Z

Thank you very much for your response!

So, the next question is:
If I want to compare the kmers composition of several samples, is it better to trim one at a time than all at once?

I know that this question is not purely appropieted in this GitHub context.
Thank you so much!

ctb · 2022-08-25T14:39:29Z

it's fine to ask here!

the mental model I use for variable abundance trimming is that the only k-mers getting removed are "bad" k-mers that are errors (low-abundance k-mers in the presence of high coverage). We have various reasons to believe this is true and little counter-evidence - in some of our measures, we're seeing that no more than 1-2% of "known good" k-mers get removed - but this is still a bit of a research project!

so I would suggest trimming all together if you want one fairly trustworthy number.

but really what I would suggest is actually calculating comparisons at three different trimming approaches -

no trimming at all
using trim-low-abund as above
trimming at a fairly stringent hard cutoff (all k-mers must have abundance > 5, or something)

if all three comparisons give you similar patterns, then you have a very reliable answer :). if they differ, you might want to dig more into what could be going on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Result differences from input bias #1918

Result differences from input bias #1918

JuantonioMS commented Aug 25, 2022

ctb commented Aug 25, 2022

JuantonioMS commented Aug 25, 2022

ctb commented Aug 25, 2022

Result differences from input bias #1918

Result differences from input bias #1918

Comments

JuantonioMS commented Aug 25, 2022

ctb commented Aug 25, 2022

JuantonioMS commented Aug 25, 2022

ctb commented Aug 25, 2022