-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DI-809 refactor #9
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 20 of 20 files at r1.
Reviewable status: 18 of 20 files reviewed, 2 unresolved discussions (waiting on @jethror1 and @mattgarner)
src/vcf_qc.py
line 7 at r1 (raw file):
import sys if os.path.exists('/home/dnanexus'):
is that a pytest thing? i.e. you need to install packages in dnanexus mode but not in pytest mode because you have a step for that?
also it's disgusting having to install packages like that in dnanexus
src/vcf_qc.py
line 73 at r1 (raw file):
""" autosomes = [ x for y in [(str(x), f"chr{x}") for x in range(1, 23)] for x in y
brain hurt a bit trying to understand that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 18 of 20 files reviewed, 2 unresolved discussions (waiting on @mattgarner and @Yu-jinKim)
src/vcf_qc.py
line 7 at r1 (raw file):
Previously, Yu-jinKim wrote…
is that a pytest thing? i.e. you need to install packages in dnanexus mode but not in pytest mode because you have a step for that?
also it's disgusting having to install packages like that in dnanexus
its really just a DNAnexus thing, the other option is to have it in execDepends
in the dxapp.json, but that means it pulls from pypi. When I asked support I got a worse answer so went with this which works fine. Also allows to keep running it locally too and not install packages
src/vcf_qc.py
line 73 at r1 (raw file):
Previously, Yu-jinKim wrote…
brain hurt a bit trying to understand that
added a comment, its just unpacking to a single list to have both with and without the chr prefix, else it ends up a list of tuples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 12 of 20 files at r1, 1 of 1 files at r5.
Reviewable status: 18 of 20 files reviewed, 7 unresolved discussions (waiting on @jethror1 and @Yu-jinKim)
src/vcf_qc.py
line 73 at r1 (raw file):
Previously, jethror1 (Jethro Rainford) wrote…
added a comment, its just unpacking to a single list to have both with and without the chr prefix, else it ends up a list of tuples
maybe not using variable names x and y in context of autosomes (when not intending to refer to chrX/chrY) is a good idea for less brain hurt
src/vcf_qc.py
line 74 at r5 (raw file):
# build single list both with and without prefix autosomes = [ x for y in [(str(x), f"chr{x}") for x in range(1, 23)] for x in y
Looks like we call this function for each variant assessed, meaning we generate an identical list over and over and over. Is there a better way whereby this list is generated once for all checks?
Code quote:
x for y in [(str(x), f"chr{x}") for x in range(1, 23)] for x in y
src/vcf_qc.py
line 115 at r5 (raw file):
} variants = autosomes = x_variants = 0
Since these are also counts, would it make sense for them to be in the counts dict?
Code quote:
variants = autosomes = x_variants = 0
src/vcf_qc.py
line 121 at r5 (raw file):
if not all(x in sample_fields for x in ['AD', 'DP', 'GT']): # TODO - do we still want to do this? does this even happen?
a vcf without these is possible. I guess the point was to check the fields we need are present and fail slightly earlier if not
Let's keep it, probably slightly easier for someone to interpret this than the error message that would come from a missing field later
Maybe the message itself could be a little more specific about which fields may be missing
Code quote:
# TODO - do we still want to do this? does this even happen?
src/vcf_qc.py
line 146 at r5 (raw file):
x_variants += 1 counts['x_hom'].append(non_ref_aaf) else:
Could we handle GT=0/1 (ref/alt1) vs GT=1/2 (alt1/alt2) differently?
This het ratio produced by this tool can give an indication of capture efficiency differences between ref vs alt alleles, but if we include GT=1/2 in the mix then we're sometimes comparing alt vs alt and so diluting that signal a bit
Code quote:
e
src/vcf_qc.py
line 196 at r5 (raw file):
# we don't have both het and hom variants => TODO figure out what to do # assume this is an empty vcf, and we'd just want to output something still? return ratios
If this happens something is drastically wrong and other qc detects it, so not too concerned about this. Setting to None makes sense
Code quote:
# we don't have both het and hom variants => TODO figure out what to do
# assume this is an empty vcf, and we'd just want to output something still?
return ratios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 20 files at r1.
Reviewable status: 18 of 20 files reviewed, 7 unresolved discussions (waiting on @jethror1 and @Yu-jinKim)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 18 of 20 files reviewed, 7 unresolved discussions (waiting on @mattgarner and @Yu-jinKim)
src/vcf_qc.py
line 73 at r1 (raw file):
Previously, mattgarner wrote…
maybe not using variable names x and y in context of autosomes (when not intending to refer to chrX/chrY) is a good idea for less brain hurt
ha yes, didn't think about that, have switched to i, j
src/vcf_qc.py
line 74 at r5 (raw file):
Previously, mattgarner wrote…
Looks like we call this function for each variant assessed, meaning we generate an identical list over and over and over. Is there a better way whereby this list is generated once for all checks?
we could remove the list generating into the calling function so it's only once, then pass it in but that makes testing worse. Alternatively add a generate_autosomes()
function to call, then have this take in the output from that which would be fine I guess.
Since this is a defined list of 24 being generated, then iterating over to split the tuples I'm not overly concerned with performance vs readability of keeping the context in single function, but I appreciate the thought of not wanting needless inefficiency (especially if this was doing something more intense / API querying etc) 🙂
demonstrating the time for that list comp:
$ python3 -m timeit -u sec '[i for j in [(str(i), f"chr{i}") for i in range(1, 23)] for i in j]'
50000 loops, best of 5: 6.3e-06 sec per loop
src/vcf_qc.py
line 115 at r5 (raw file):
Previously, mattgarner wrote…
Since these are also counts, would it make sense for them to be in the counts dict?
I could, I just wanted them there for a simple count at the end of the function, I don't use them anywhere else
src/vcf_qc.py
line 121 at r5 (raw file):
Previously, mattgarner wrote…
a vcf without these is possible. I guess the point was to check the fields we need are present and fail slightly earlier if not
Let's keep it, probably slightly easier for someone to interpret this than the error message that would come from a missing field later
Maybe the message itself could be a little more specific about which fields may be missing
so print a warning and continue? Shall add that since it makes sense
src/vcf_qc.py
line 146 at r5 (raw file):
Previously, mattgarner wrote…
Could we handle GT=0/1 (ref/alt1) vs GT=1/2 (alt1/alt2) differently?
This het ratio produced by this tool can give an indication of capture efficiency differences between ref vs alt alleles, but if we include GT=1/2 in the mix then we're sometimes comparing alt vs alt and so diluting that signal a bit
sure, what would you like doing? do you want them just excluding from the mean_het and het:hom ratio and adding to their own field?
src/vcf_qc.py
line 196 at r5 (raw file):
Previously, mattgarner wrote…
If this happens something is drastically wrong and other qc detects it, so not too concerned about this. Setting to None makes sense
I was more thinking if an empty vcf goes in (i.e. a sample has virtually no reads), that it still outputs something so the job doesn't fail and cause downstream jobs to fail. I would hope we would pick this up in multiQC 🙂
will ditch the TODO and make sure an empty vcf works, probably should write some tests for main for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r8, all commit messages.
Reviewable status: 19 of 20 files reviewed, 6 unresolved discussions (waiting on @jethror1 and @Yu-jinKim)
src/vcf_qc.py
line 115 at r5 (raw file):
Previously, jethror1 (Jethro Rainford) wrote…
I could, I just wanted them there for a simple count at the end of the function, I don't use them anywhere else
Then instead could the existing vars be renamed variants_count, autosomes_count etc, since they are counts of variants rather than variants
src/vcf_qc.py
line 146 at r5 (raw file):
Previously, jethror1 (Jethro Rainford) wrote…
sure, what would you like doing? do you want them just excluding from the mean_het and het:hom ratio and adding to their own field?
Yeah, exclude from the I cannot really think of a way they are useful as their own separate group, but doing so does somewhat imply that they are excluded from the other group and aid understanding so let's go with that
src/vcf_qc.py
line 196 at r5 (raw file):
Previously, jethror1 (Jethro Rainford) wrote…
I was more thinking if an empty vcf goes in (i.e. a sample has virtually no reads), that it still outputs something so the job doesn't fail and cause downstream jobs to fail. I would hope we would pick this up in multiQC 🙂
will ditch the TODO and make sure an empty vcf works, probably should write some tests for main for this
I think we're agreed (though maybe I've not said what I meant in the clearest way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 17 of 21 files reviewed, 6 unresolved discussions (waiting on @mattgarner and @Yu-jinKim)
src/vcf_qc.py
line 115 at r5 (raw file):
Previously, mattgarner wrote…
Then instead could the existing vars be renamed variants_count, autosomes_count etc, since they are counts of variants rather than variants
Done.
src/vcf_qc.py
line 146 at r5 (raw file):
Previously, mattgarner wrote…
Yeah, exclude from the I cannot really think of a way they are useful as their own separate group, but doing so does somewhat imply that they are excluded from the other group and aid understanding so let's go with that
now removed and added into the counts dict to be displayed, but not included in calculations
src/vcf_qc.py
line 196 at r5 (raw file):
Previously, mattgarner wrote…
I think we're agreed (though maybe I've not said what I meant in the clearest way)
added tests for main, including one to test an empty vcf passes through with no errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 4 files at r9, 2 of 3 files at r10, all commit messages.
Reviewable status: 21 of 22 files reviewed, 2 unresolved discussions (waiting on @Yu-jinKim)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r10.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @jethror1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r11, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @jethror1)
-u
to bedtools intersect to remove duplicate variants from overlapping bed file regionsThis change is