Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DI-809 refactor #9

Merged
merged 65 commits into from
Sep 23, 2024
Merged

DI-809 refactor #9

merged 65 commits into from
Sep 23, 2024

Conversation

jethror1
Copy link
Contributor

@jethror1 jethror1 commented Sep 20, 2024


This change is Reviewable

Copy link
Contributor

@Yu-jinKim Yu-jinKim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 20 of 20 files at r1.
Reviewable status: 18 of 20 files reviewed, 2 unresolved discussions (waiting on @jethror1 and @mattgarner)


src/vcf_qc.py line 7 at r1 (raw file):

import sys

if os.path.exists('/home/dnanexus'):

is that a pytest thing? i.e. you need to install packages in dnanexus mode but not in pytest mode because you have a step for that?
also it's disgusting having to install packages like that in dnanexus


src/vcf_qc.py line 73 at r1 (raw file):

    """
    autosomes = [
        x for y in [(str(x), f"chr{x}") for x in range(1, 23)] for x in y

brain hurt a bit trying to understand that

Copy link
Contributor Author

@jethror1 jethror1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 18 of 20 files reviewed, 2 unresolved discussions (waiting on @mattgarner and @Yu-jinKim)


src/vcf_qc.py line 7 at r1 (raw file):

Previously, Yu-jinKim wrote…

is that a pytest thing? i.e. you need to install packages in dnanexus mode but not in pytest mode because you have a step for that?
also it's disgusting having to install packages like that in dnanexus

its really just a DNAnexus thing, the other option is to have it in execDepends in the dxapp.json, but that means it pulls from pypi. When I asked support I got a worse answer so went with this which works fine. Also allows to keep running it locally too and not install packages


src/vcf_qc.py line 73 at r1 (raw file):

Previously, Yu-jinKim wrote…

brain hurt a bit trying to understand that

added a comment, its just unpacking to a single list to have both with and without the chr prefix, else it ends up a list of tuples

Copy link
Contributor

@mattgarner mattgarner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 20 files at r1, 1 of 1 files at r5.
Reviewable status: 18 of 20 files reviewed, 7 unresolved discussions (waiting on @jethror1 and @Yu-jinKim)


src/vcf_qc.py line 73 at r1 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

added a comment, its just unpacking to a single list to have both with and without the chr prefix, else it ends up a list of tuples

maybe not using variable names x and y in context of autosomes (when not intending to refer to chrX/chrY) is a good idea for less brain hurt


src/vcf_qc.py line 74 at r5 (raw file):

    # build single list both with and without prefix
    autosomes = [
        x for y in [(str(x), f"chr{x}") for x in range(1, 23)] for x in y

Looks like we call this function for each variant assessed, meaning we generate an identical list over and over and over. Is there a better way whereby this list is generated once for all checks?

Code quote:

x for y in [(str(x), f"chr{x}") for x in range(1, 23)] for x in y

src/vcf_qc.py line 115 at r5 (raw file):

    }

    variants = autosomes = x_variants = 0

Since these are also counts, would it make sense for them to be in the counts dict?

Code quote:

variants = autosomes = x_variants = 0

src/vcf_qc.py line 121 at r5 (raw file):

        if not all(x in sample_fields for x in ['AD', 'DP', 'GT']):
            # TODO - do we still want to do this? does this even happen?

a vcf without these is possible. I guess the point was to check the fields we need are present and fail slightly earlier if not

Let's keep it, probably slightly easier for someone to interpret this than the error message that would come from a missing field later

Maybe the message itself could be a little more specific about which fields may be missing

Code quote:

            # TODO - do we still want to do this? does this even happen?

src/vcf_qc.py line 146 at r5 (raw file):

                x_variants += 1
                counts['x_hom'].append(non_ref_aaf)
        else:

Could we handle GT=0/1 (ref/alt1) vs GT=1/2 (alt1/alt2) differently?

This het ratio produced by this tool can give an indication of capture efficiency differences between ref vs alt alleles, but if we include GT=1/2 in the mix then we're sometimes comparing alt vs alt and so diluting that signal a bit

Code quote:

e

src/vcf_qc.py line 196 at r5 (raw file):

        # we don't have both het and hom variants => TODO figure out what to do
        # assume this is an empty vcf, and we'd just want to output something still?
        return ratios

If this happens something is drastically wrong and other qc detects it, so not too concerned about this. Setting to None makes sense

Code quote:

        # we don't have both het and hom variants => TODO figure out what to do
        # assume this is an empty vcf, and we'd just want to output something still?
        return ratios

Copy link
Contributor

@mattgarner mattgarner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 20 files at r1.
Reviewable status: 18 of 20 files reviewed, 7 unresolved discussions (waiting on @jethror1 and @Yu-jinKim)

Copy link
Contributor Author

@jethror1 jethror1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 18 of 20 files reviewed, 7 unresolved discussions (waiting on @mattgarner and @Yu-jinKim)


src/vcf_qc.py line 73 at r1 (raw file):

Previously, mattgarner wrote…

maybe not using variable names x and y in context of autosomes (when not intending to refer to chrX/chrY) is a good idea for less brain hurt

ha yes, didn't think about that, have switched to i, j


src/vcf_qc.py line 74 at r5 (raw file):

Previously, mattgarner wrote…

Looks like we call this function for each variant assessed, meaning we generate an identical list over and over and over. Is there a better way whereby this list is generated once for all checks?

we could remove the list generating into the calling function so it's only once, then pass it in but that makes testing worse. Alternatively add a generate_autosomes() function to call, then have this take in the output from that which would be fine I guess.

Since this is a defined list of 24 being generated, then iterating over to split the tuples I'm not overly concerned with performance vs readability of keeping the context in single function, but I appreciate the thought of not wanting needless inefficiency (especially if this was doing something more intense / API querying etc) 🙂

demonstrating the time for that list comp:

$ python3 -m timeit -u sec '[i for j in [(str(i), f"chr{i}") for i in range(1, 23)] for i in j]'
50000 loops, best of 5: 6.3e-06 sec per loop

src/vcf_qc.py line 115 at r5 (raw file):

Previously, mattgarner wrote…

Since these are also counts, would it make sense for them to be in the counts dict?

I could, I just wanted them there for a simple count at the end of the function, I don't use them anywhere else


src/vcf_qc.py line 121 at r5 (raw file):

Previously, mattgarner wrote…

a vcf without these is possible. I guess the point was to check the fields we need are present and fail slightly earlier if not

Let's keep it, probably slightly easier for someone to interpret this than the error message that would come from a missing field later

Maybe the message itself could be a little more specific about which fields may be missing

so print a warning and continue? Shall add that since it makes sense


src/vcf_qc.py line 146 at r5 (raw file):

Previously, mattgarner wrote…

Could we handle GT=0/1 (ref/alt1) vs GT=1/2 (alt1/alt2) differently?

This het ratio produced by this tool can give an indication of capture efficiency differences between ref vs alt alleles, but if we include GT=1/2 in the mix then we're sometimes comparing alt vs alt and so diluting that signal a bit

sure, what would you like doing? do you want them just excluding from the mean_het and het:hom ratio and adding to their own field?


src/vcf_qc.py line 196 at r5 (raw file):

Previously, mattgarner wrote…

If this happens something is drastically wrong and other qc detects it, so not too concerned about this. Setting to None makes sense

I was more thinking if an empty vcf goes in (i.e. a sample has virtually no reads), that it still outputs something so the job doesn't fail and cause downstream jobs to fail. I would hope we would pick this up in multiQC 🙂

will ditch the TODO and make sure an empty vcf works, probably should write some tests for main for this

Copy link
Contributor

@mattgarner mattgarner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r8, all commit messages.
Reviewable status: 19 of 20 files reviewed, 6 unresolved discussions (waiting on @jethror1 and @Yu-jinKim)


src/vcf_qc.py line 115 at r5 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

I could, I just wanted them there for a simple count at the end of the function, I don't use them anywhere else

Then instead could the existing vars be renamed variants_count, autosomes_count etc, since they are counts of variants rather than variants


src/vcf_qc.py line 146 at r5 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

sure, what would you like doing? do you want them just excluding from the mean_het and het:hom ratio and adding to their own field?

Yeah, exclude from the I cannot really think of a way they are useful as their own separate group, but doing so does somewhat imply that they are excluded from the other group and aid understanding so let's go with that


src/vcf_qc.py line 196 at r5 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

I was more thinking if an empty vcf goes in (i.e. a sample has virtually no reads), that it still outputs something so the job doesn't fail and cause downstream jobs to fail. I would hope we would pick this up in multiQC 🙂

will ditch the TODO and make sure an empty vcf works, probably should write some tests for main for this

I think we're agreed (though maybe I've not said what I meant in the clearest way)

Copy link
Contributor Author

@jethror1 jethror1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 17 of 21 files reviewed, 6 unresolved discussions (waiting on @mattgarner and @Yu-jinKim)


src/vcf_qc.py line 115 at r5 (raw file):

Previously, mattgarner wrote…

Then instead could the existing vars be renamed variants_count, autosomes_count etc, since they are counts of variants rather than variants

Done.


src/vcf_qc.py line 146 at r5 (raw file):

Previously, mattgarner wrote…

Yeah, exclude from the I cannot really think of a way they are useful as their own separate group, but doing so does somewhat imply that they are excluded from the other group and aid understanding so let's go with that

now removed and added into the counts dict to be displayed, but not included in calculations


src/vcf_qc.py line 196 at r5 (raw file):

Previously, mattgarner wrote…

I think we're agreed (though maybe I've not said what I meant in the clearest way)

added tests for main, including one to test an empty vcf passes through with no errors

Copy link
Contributor

@mattgarner mattgarner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r9, 2 of 3 files at r10, all commit messages.
Reviewable status: 21 of 22 files reviewed, 2 unresolved discussions (waiting on @Yu-jinKim)

Copy link
Contributor

@mattgarner mattgarner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r10.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @jethror1)

Copy link
Contributor

@Yu-jinKim Yu-jinKim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r11, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @jethror1)

@Yu-jinKim Yu-jinKim merged commit 927ffb1 into master Sep 23, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants