Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port to Python 3 and other changes #25

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

peterk87
Copy link

@peterk87 peterk87 commented Sep 4, 2020

Hello,

I was recently trying to use Genotyphi for subtyping S. Typhi genomes and encountered some issues integrating the tool with the rest of the tools in a workflow, so I forked and refactored the code to be compatible with Python 3.7+ and to understand it a bit better.

I hope the changes help the project.

Thanks,
Peter

Changes:

  • conversion to Python 3 compatible code with 2to3
  • Python TempDirectory for BAM to VCF
  • Python subprocess piping of output from samtools mpileup to bcftools call to reduce disk IO
  • Converted print statements to console logging statements
  • added some type hints using the Python typing module
  • fixed line endings for Linux (initially I couldn't get the script to execute due to an issue with \r characters)
  • tabs to spaces (as recommended in PEP8)
  • BioPython for FASTA parsing with SimpleFastaParser
  • default write output to standard output so users can run $genotyphi --args ... > genotyphi-output.txt or pipe it into other programs
  • removed prepending of timestamp to output file path which could cause issues when users provide an absolute file path for output
  • refactoring for readability (extracted methods, f-string literals)

Also, one thing I noticed is that multiple lists are used to describe linked data where one dict of tuples could be used instead, e.g.

loci = [1, 2, 3, ...]
snp_alleles = ['A', 'G', 'T', 'C', ...]
groups = ['0.1', '0.0.1', ...]

The above could be converted to the following:

snp_markers = {1: ('A', '0.1'), 2: ('G', '0.0.1'), ...}

This change would make it easier to add/update loci/groups/alleles, and when checking if a variant from a VCF file is one of the genotyping markers, search for the key in a snp_markers dict would be in constant time rather than linear time with loci.index(variant_location), e.g.

def checkSNP(vcf_line_split, this_groups, proportions, args):
    location = int(vcf_line_split[1])
    try:
        snp_allele, group = snp_markers[location]
        ...
    except KeyError:
        pass
    return this_groups, proportions

@katholt
Copy link
Collaborator

katholt commented Sep 5, 2020

Thanks Peter. Converting to py3 is something we know needs to be done, but have been putting off doing ourselves as it was not needed for our own workflows and we have had other priorities. So - we very much appreciate you making and sharing these changes.
We will review & test the code and merge if all looks good.

Re the use of dictionary - of course you are right, and the current setup is a legacy from early code development that we never went back to optimise. We are actually planning to change the code to read the genotype info into a dict from a simple text file, to make it as easy as possible to change/add new groups and also to see/reuse the genotype definitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants