DI-809 refactor #9

jethror1 · 2024-09-20T08:50:16Z

refactor of app to Python
remove external dependencies, switch bedtools to using asset
add -u to bedtools intersect to remove duplicate variants from overlapping bed file regions
add unit tests
- coverage at 97%
ensure only autosomes used for mean het, mean hom and het:hom calculations
fix all the issues so calculations are actually correct
test job: https://platform.dnanexus.com/panx/projects/GqZ0Zp84bZv0G622qpqKf32y/monitor/job/GqfYjJQ4bZv102YgYZQ1xb23GqfYF5j4bZvPQ0kV2xZ1zX4y

This change is

Yu-jinKim

Reviewed 20 of 20 files at r1.
Reviewable status: 18 of 20 files reviewed, 2 unresolved discussions (waiting on @jethror1 and @mattgarner)

src/vcf_qc.py line 7 at r1 (raw file):

import sys

if os.path.exists('/home/dnanexus'):

is that a pytest thing? i.e. you need to install packages in dnanexus mode but not in pytest mode because you have a step for that?
also it's disgusting having to install packages like that in dnanexus

src/vcf_qc.py line 73 at r1 (raw file):

    """
    autosomes = [
        x for y in [(str(x), f"chr{x}") for x in range(1, 23)] for x in y

brain hurt a bit trying to understand that

jethror1

Reviewable status: 18 of 20 files reviewed, 2 unresolved discussions (waiting on @mattgarner and @Yu-jinKim)

src/vcf_qc.py line 7 at r1 (raw file):

Previously, Yu-jinKim wrote…

is that a pytest thing? i.e. you need to install packages in dnanexus mode but not in pytest mode because you have a step for that?
also it's disgusting having to install packages like that in dnanexus

its really just a DNAnexus thing, the other option is to have it in execDepends in the dxapp.json, but that means it pulls from pypi. When I asked support I got a worse answer so went with this which works fine. Also allows to keep running it locally too and not install packages

src/vcf_qc.py line 73 at r1 (raw file):

Previously, Yu-jinKim wrote…

brain hurt a bit trying to understand that

added a comment, its just unpacking to a single list to have both with and without the chr prefix, else it ends up a list of tuples

mattgarner

Reviewed 12 of 20 files at r1, 1 of 1 files at r5.
Reviewable status: 18 of 20 files reviewed, 7 unresolved discussions (waiting on @jethror1 and @Yu-jinKim)

src/vcf_qc.py line 73 at r1 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

added a comment, its just unpacking to a single list to have both with and without the chr prefix, else it ends up a list of tuples

maybe not using variable names x and y in context of autosomes (when not intending to refer to chrX/chrY) is a good idea for less brain hurt

src/vcf_qc.py line 74 at r5 (raw file):

    # build single list both with and without prefix
    autosomes = [
        x for y in [(str(x), f"chr{x}") for x in range(1, 23)] for x in y

Looks like we call this function for each variant assessed, meaning we generate an identical list over and over and over. Is there a better way whereby this list is generated once for all checks?

Code quote:

x for y in [(str(x), f"chr{x}") for x in range(1, 23)] for x in y

src/vcf_qc.py line 115 at r5 (raw file):

    }

    variants = autosomes = x_variants = 0

Since these are also counts, would it make sense for them to be in the counts dict?

Code quote:

variants = autosomes = x_variants = 0

src/vcf_qc.py line 121 at r5 (raw file):

        if not all(x in sample_fields for x in ['AD', 'DP', 'GT']):
            # TODO - do we still want to do this? does this even happen?

a vcf without these is possible. I guess the point was to check the fields we need are present and fail slightly earlier if not

Let's keep it, probably slightly easier for someone to interpret this than the error message that would come from a missing field later

Maybe the message itself could be a little more specific about which fields may be missing

Code quote:

            # TODO - do we still want to do this? does this even happen?

src/vcf_qc.py line 146 at r5 (raw file):

                x_variants += 1
                counts['x_hom'].append(non_ref_aaf)
        else:

Could we handle GT=0/1 (ref/alt1) vs GT=1/2 (alt1/alt2) differently?

This het ratio produced by this tool can give an indication of capture efficiency differences between ref vs alt alleles, but if we include GT=1/2 in the mix then we're sometimes comparing alt vs alt and so diluting that signal a bit

Code quote:

src/vcf_qc.py line 196 at r5 (raw file):

        # we don't have both het and hom variants => TODO figure out what to do
        # assume this is an empty vcf, and we'd just want to output something still?
        return ratios

If this happens something is drastically wrong and other qc detects it, so not too concerned about this. Setting to None makes sense

Code quote:

        # we don't have both het and hom variants => TODO figure out what to do
        # assume this is an empty vcf, and we'd just want to output something still?
        return ratios

mattgarner

Reviewed 5 of 20 files at r1.
Reviewable status: 18 of 20 files reviewed, 7 unresolved discussions (waiting on @jethror1 and @Yu-jinKim)

jethror1

Reviewable status: 18 of 20 files reviewed, 7 unresolved discussions (waiting on @mattgarner and @Yu-jinKim)

src/vcf_qc.py line 73 at r1 (raw file):

Previously, mattgarner wrote…

maybe not using variable names x and y in context of autosomes (when not intending to refer to chrX/chrY) is a good idea for less brain hurt

ha yes, didn't think about that, have switched to i, j

src/vcf_qc.py line 74 at r5 (raw file):

Previously, mattgarner wrote…

Looks like we call this function for each variant assessed, meaning we generate an identical list over and over and over. Is there a better way whereby this list is generated once for all checks?

we could remove the list generating into the calling function so it's only once, then pass it in but that makes testing worse. Alternatively add a generate_autosomes() function to call, then have this take in the output from that which would be fine I guess.

Since this is a defined list of 24 being generated, then iterating over to split the tuples I'm not overly concerned with performance vs readability of keeping the context in single function, but I appreciate the thought of not wanting needless inefficiency (especially if this was doing something more intense / API querying etc) 🙂

demonstrating the time for that list comp:

$ python3 -m timeit -u sec '[i for j in [(str(i), f"chr{i}") for i in range(1, 23)] for i in j]'
50000 loops, best of 5: 6.3e-06 sec per loop

src/vcf_qc.py line 115 at r5 (raw file):

Previously, mattgarner wrote…

Since these are also counts, would it make sense for them to be in the counts dict?

I could, I just wanted them there for a simple count at the end of the function, I don't use them anywhere else

src/vcf_qc.py line 121 at r5 (raw file):

Previously, mattgarner wrote…

a vcf without these is possible. I guess the point was to check the fields we need are present and fail slightly earlier if not

Let's keep it, probably slightly easier for someone to interpret this than the error message that would come from a missing field later

Maybe the message itself could be a little more specific about which fields may be missing

so print a warning and continue? Shall add that since it makes sense

src/vcf_qc.py line 146 at r5 (raw file):

Previously, mattgarner wrote…

Could we handle GT=0/1 (ref/alt1) vs GT=1/2 (alt1/alt2) differently?

This het ratio produced by this tool can give an indication of capture efficiency differences between ref vs alt alleles, but if we include GT=1/2 in the mix then we're sometimes comparing alt vs alt and so diluting that signal a bit

sure, what would you like doing? do you want them just excluding from the mean_het and het:hom ratio and adding to their own field?

src/vcf_qc.py line 196 at r5 (raw file):

Previously, mattgarner wrote…

If this happens something is drastically wrong and other qc detects it, so not too concerned about this. Setting to None makes sense

I was more thinking if an empty vcf goes in (i.e. a sample has virtually no reads), that it still outputs something so the job doesn't fail and cause downstream jobs to fail. I would hope we would pick this up in multiQC 🙂

will ditch the TODO and make sure an empty vcf works, probably should write some tests for main for this

mattgarner

Reviewed 1 of 1 files at r8, all commit messages.
Reviewable status: 19 of 20 files reviewed, 6 unresolved discussions (waiting on @jethror1 and @Yu-jinKim)

src/vcf_qc.py line 115 at r5 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

I could, I just wanted them there for a simple count at the end of the function, I don't use them anywhere else

Then instead could the existing vars be renamed variants_count, autosomes_count etc, since they are counts of variants rather than variants

src/vcf_qc.py line 146 at r5 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

sure, what would you like doing? do you want them just excluding from the mean_het and het:hom ratio and adding to their own field?

Yeah, exclude from the I cannot really think of a way they are useful as their own separate group, but doing so does somewhat imply that they are excluded from the other group and aid understanding so let's go with that

src/vcf_qc.py line 196 at r5 (raw file):

Previously, jethror1 (Jethro Rainford) wrote…

I was more thinking if an empty vcf goes in (i.e. a sample has virtually no reads), that it still outputs something so the job doesn't fail and cause downstream jobs to fail. I would hope we would pick this up in multiQC 🙂

will ditch the TODO and make sure an empty vcf works, probably should write some tests for main for this

I think we're agreed (though maybe I've not said what I meant in the clearest way)

jethror1

Reviewable status: 17 of 21 files reviewed, 6 unresolved discussions (waiting on @mattgarner and @Yu-jinKim)

src/vcf_qc.py line 115 at r5 (raw file):

Previously, mattgarner wrote…

Then instead could the existing vars be renamed variants_count, autosomes_count etc, since they are counts of variants rather than variants

Done.

src/vcf_qc.py line 146 at r5 (raw file):

Previously, mattgarner wrote…

Yeah, exclude from the I cannot really think of a way they are useful as their own separate group, but doing so does somewhat imply that they are excluded from the other group and aid understanding so let's go with that

now removed and added into the counts dict to be displayed, but not included in calculations

src/vcf_qc.py line 196 at r5 (raw file):

Previously, mattgarner wrote…

I think we're agreed (though maybe I've not said what I meant in the clearest way)

added tests for main, including one to test an empty vcf passes through with no errors

mattgarner

Reviewed 3 of 4 files at r9, 2 of 3 files at r10, all commit messages.
Reviewable status: 21 of 22 files reviewed, 2 unresolved discussions (waiting on @Yu-jinKim)

mattgarner

Reviewed 1 of 3 files at r10.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @jethror1)

Yu-jinKim

Reviewed 1 of 1 files at r11, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @jethror1)

jethror1 added 30 commits September 13, 2024 10:07

update dxapp.json

f9687f0

remove bedtools tar

5ce2a82

remove shell script

bc1b432

move python app code

bc38d05

add .gitignore

191fd68

add functions to get variant counts and calculate ratios

9f2e35a

formatting

fb1998d

more comments

db7ec5c

remove comment

3d9f4ec

update readme

c437920

fix formatting

d211dc3

update readme

23aa426

add uploading of output file

b72ee3f

add bedtools dependency asset

fb91092

add comments

18fa0ce

remove default file for bed file

3d8e20e

add developers and authorizedUsers to dxapp.json

f9ba5a5

add version to dxapp.json

ef462c6

update entrypoint in dxapp.json

7147850

add ubuntu version to dxapp.json

7e921f0

add rounding

97d8b84

update interpreter in dxapp.json

24800a1

actually download the input files

d6423ca

update .gitignore

7eca8fb

add pysam wheel

d9cec7b

add requirements.txt

10b991c

add test setup

354956b

skip ref hom sites

5950946

bonus upload print

5221ac4

update pytest workflow

8bf1c93

jethror1 added 3 commits September 20, 2024 09:58

improve log of record

b20429a

fix to allow running locally and in DNAnexus

fb926e1

fix pep8 issues

88e58a2

Yu-jinKim requested changes Sep 20, 2024

View reviewed changes

jethror1 added 2 commits September 20, 2024 10:11

improve prints

48eed12

add comment

ef94e44

jethror1 commented Sep 20, 2024

View reviewed changes

jethror1 added 2 commits September 20, 2024 10:17

update output field help

ae857d9

Black reformat

c320619

mattgarner requested changes Sep 20, 2024

View reviewed changes

mattgarner reviewed Sep 20, 2024

View reviewed changes

remove x var confusion

dbf1865

jethror1 commented Sep 20, 2024

View reviewed changes

improve logging of variants with missing fields

f9d363f

mattgarner requested changes Sep 20, 2024

View reviewed changes

jethror1 added 6 commits September 20, 2024 13:19

add unit tests for main

b1a68d5

add empty vcf

1d424bd

add additional unit tests for main

963950f

improve counter

5db01f1

separate alt het variants out

d766f84

update readme

fea204b

jethror1 commented Sep 20, 2024

View reviewed changes

jethror1 added 3 commits September 20, 2024 14:24

add alt het variant to test.vcf

429be63

update tests for alt het variant

d872a7f

add tests for skipping variants with missing values

6224a32

mattgarner reviewed Sep 20, 2024

View reviewed changes

mattgarner approved these changes Sep 20, 2024

View reviewed changes

fix readme typo

20aa472

Yu-jinKim approved these changes Sep 23, 2024

View reviewed changes

Yu-jinKim merged commit 927ffb1 into master Sep 23, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DI-809 refactor #9

DI-809 refactor #9

jethror1 commented Sep 20, 2024 •

edited

Loading

Yu-jinKim left a comment

jethror1 left a comment

mattgarner left a comment

mattgarner left a comment

jethror1 left a comment

mattgarner left a comment

jethror1 left a comment

mattgarner left a comment

mattgarner left a comment

Yu-jinKim left a comment

DI-809 refactor #9

DI-809 refactor #9

Conversation

jethror1 commented Sep 20, 2024 • edited Loading

Yu-jinKim left a comment

Choose a reason for hiding this comment

jethror1 left a comment

Choose a reason for hiding this comment

mattgarner left a comment

Choose a reason for hiding this comment

mattgarner left a comment

Choose a reason for hiding this comment

jethror1 left a comment

Choose a reason for hiding this comment

mattgarner left a comment

Choose a reason for hiding this comment

jethror1 left a comment

Choose a reason for hiding this comment

mattgarner left a comment

Choose a reason for hiding this comment

mattgarner left a comment

Choose a reason for hiding this comment

Yu-jinKim left a comment

Choose a reason for hiding this comment

jethror1 commented Sep 20, 2024 •

edited

Loading