Port to Python 3 and other changes #25

peterk87 · 2020-09-04T20:48:01Z

Hello,

I was recently trying to use Genotyphi for subtyping S. Typhi genomes and encountered some issues integrating the tool with the rest of the tools in a workflow, so I forked and refactored the code to be compatible with Python 3.7+ and to understand it a bit better.

I hope the changes help the project.

Thanks,
Peter

Changes:

conversion to Python 3 compatible code with 2to3
Python TempDirectory for BAM to VCF
Python subprocess piping of output from samtools mpileup to bcftools call to reduce disk IO
Converted print statements to console logging statements
added some type hints using the Python typing module
fixed line endings for Linux (initially I couldn't get the script to execute due to an issue with \r characters)
tabs to spaces (as recommended in PEP8)
BioPython for FASTA parsing with SimpleFastaParser
default write output to standard output so users can run $genotyphi --args ... > genotyphi-output.txt or pipe it into other programs
removed prepending of timestamp to output file path which could cause issues when users provide an absolute file path for output
refactoring for readability (extracted methods, f-string literals)

Also, one thing I noticed is that multiple lists are used to describe linked data where one dict of tuples could be used instead, e.g.

loci = [1, 2, 3, ...]
snp_alleles = ['A', 'G', 'T', 'C', ...]
groups = ['0.1', '0.0.1', ...]

The above could be converted to the following:

snp_markers = {1: ('A', '0.1'), 2: ('G', '0.0.1'), ...}

This change would make it easier to add/update loci/groups/alleles, and when checking if a variant from a VCF file is one of the genotyping markers, search for the key in a snp_markers dict would be in constant time rather than linear time with loci.index(variant_location), e.g.

def checkSNP(vcf_line_split, this_groups, proportions, args):
    location = int(vcf_line_split[1])
    try:
        snp_allele, group = snp_markers[location]
        ...
    except KeyError:
        pass
    return this_groups, proportions

pipe output of mpileup to bcftools call setup logging general refactoring

Added changes from typhoidgenomics@55eb40b

katholt · 2020-09-05T11:17:48Z

Thanks Peter. Converting to py3 is something we know needs to be done, but have been putting off doing ourselves as it was not needed for our own workflows and we have had other priorities. So - we very much appreciate you making and sharing these changes.
We will review & test the code and merge if all looks good.

Re the use of dictionary - of course you are right, and the current setup is a legacy from early code development that we never went back to optimise. We are actually planning to change the code to read the genotype info into a dict from a simple text file, to make it as easy as possible to change/add new groups and also to see/reuse the genotype definitions.

peterk87 added 6 commits August 20, 2020 09:20

2to3-3.7 conversion of genotyphi.py

7ee9698

PyCharm auto formatting and suggested edits

c126249

fix line endings with vim ':set ff=unix'

84f3ddd

samtools mpileup and bcftools call in tmp dir

ca0d42b

pipe output of mpileup to bcftools call setup logging general refactoring

fix snp count

8cd0893

Updated to be inline with upstream

3b5b080

Added changes from typhoidgenomics@55eb40b

katholt assigned zadyson Sep 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port to Python 3 and other changes #25

Port to Python 3 and other changes #25

peterk87 commented Sep 4, 2020

katholt commented Sep 5, 2020

Port to Python 3 and other changes #25

Are you sure you want to change the base?

Port to Python 3 and other changes #25

Conversation

peterk87 commented Sep 4, 2020

katholt commented Sep 5, 2020