Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug in compute_fraglen.py #99

Open
mrmilhaven opened this issue Oct 20, 2021 · 2 comments
Open

Potential bug in compute_fraglen.py #99

mrmilhaven opened this issue Oct 20, 2021 · 2 comments

Comments

@mrmilhaven
Copy link

mrmilhaven commented Oct 20, 2021

Describe the bug
A clear and concise description of what the bug is.
Fragment Length Distribution is incorrect, even when using NEAT's gen_reads.py

To Reproduce
Steps to reproduce the behavior:

  1. Create NEAT dataset with fragment length distribution as mean = 300 and SD = 10 (test.fa is 1Mb of random nucleotides, no Ns):
    python ~/NEAT/gen_reads.py -R 100 -r test.fa -c 10 --pe 300 10 -o NEAT_test

2.Map reads (I used BWA):
bwa mem -M test.fa NEAT_test_read1.fq.gz NEAT_test_read2.fq.gz > NEAT_test.bam

  1. Compute fragment distribution as sanity check:
    python ~/NEAT/utilities/compute_fraglen.py -i NEAT_test.bam -o NEAT_test

  2. open pickle file:
    import pandas
    pandas.read_pickle("NEAT_test.p")
    [[100], [1.0]]

Expected behavior
I would expect the same distribution that I gave gen_reads.py (mean of 300 with SD of 10) to come back out.

Potential Reason
I think the bug has to do with the count_frags method:
First 9 fields in BAM from BAM file:
NEAT_test-test-1 99 test 9039 60 100M = 9254 315

First 9 fields of BAM from script(adding print line within count_frags for loop, approx. line 96):
NEAT_test-test-1 99 0 9038 60 100M 0 9253 100

I believe that converting from an iterable pysam type to a string (line 95) causes some of the fields to change. This also implies that , since mate mapping(field 8) is always equal to the reference (field 3), every read will look like it was mate mapped, per your code (line 108).

I think this can be fixed by using pysams inherent parameters, such as item.mate_is_unmapped or item.template_length instead of splitting the line by tabs.

@joshfactorial
Copy link
Collaborator

Can you duplicate this bug on our main development repo: https://github.com/ncsa/NEAT/issues? That will help us keep things together. I'm currently investigating those utilities, so this is timely.

@mrmilhaven
Copy link
Author

Duplicated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants