Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result differences from input bias #1918

Open
JuantonioMS opened this issue Aug 25, 2022 · 3 comments
Open

Result differences from input bias #1918

JuantonioMS opened this issue Aug 25, 2022 · 3 comments

Comments

@JuantonioMS
Copy link

Hi! I have an issue/question about the trim-low-abund.py script.

An example:
I have two diffent files -> file_1.fq.gz and file_2.fq.gz

The resulting output for file_1.fq.gz is not the same between this two cases:
(Case 1) trim-low-abund.py -C 3 -V -Z 18 -M 20G file_1.fq.gz
(Case 2) trim-low-abund.py -C 3 -V -Z 18 -M 20G file_1.fq.gz file_2.fq.gz

Case 1 file_1.fq.gz.abundtrim is not the same that Case 2 file_1.fq.gz.abundtrim. Is this behavior expected? What am I missing?

Thanks!

@ctb
Copy link
Member

ctb commented Aug 25, 2022

Hello, @JuantonioMS , thanks for asking!

yes, the increased coverage from file_2.fq.gz will affect the trimming for sequences in file_1.fq.gz as well as the false positive rate of the count min sketches. I would expect more sequences to be trimmed in file_1.fq.gz, in particular.

@JuantonioMS
Copy link
Author

Thank you very much for your response!

So, the next question is:
If I want to compare the kmers composition of several samples, is it better to trim one at a time than all at once?

I know that this question is not purely appropieted in this GitHub context.
Thank you so much!

@ctb
Copy link
Member

ctb commented Aug 25, 2022

it's fine to ask here!

the mental model I use for variable abundance trimming is that the only k-mers getting removed are "bad" k-mers that are errors (low-abundance k-mers in the presence of high coverage). We have various reasons to believe this is true and little counter-evidence - in some of our measures, we're seeing that no more than 1-2% of "known good" k-mers get removed - but this is still a bit of a research project!

so I would suggest trimming all together if you want one fairly trustworthy number.

but really what I would suggest is actually calculating comparisons at three different trimming approaches -

  1. no trimming at all
  2. using trim-low-abund as above
  3. trimming at a fairly stringent hard cutoff (all k-mers must have abundance > 5, or something)

if all three comparisons give you similar patterns, then you have a very reliable answer :). if they differ, you might want to dig more into what could be going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants