Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update handling of INFO/END when writing records in VCF #1201

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rickymagner
Copy link

This PR is meant to resolve issue #1200. In particular, it changes the way the write method for VariantFile handles reasoning about when to write INFO/END or not. Previously, the code attempted to check to write this only when there were symbolic alleles, but ended up only writing this for insertions when asked for explicitly.

The code change now decides to exclude writing INFO/END if it's not present in the header, but will write in all cases when included in the header. This should allow users to update END values using record.stop, like in the following examples.


Start with example.vcf.gz as:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10000000>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.14+htslib-1.14
##bcftools_viewCommand=view -o example.vcf.gz example.vcf; Date=Wed Jun 28 16:24:56 2023
##bcftools_viewCommand=view example.vcf.gz; Date=Thu Jun 29 10:49:17 2023
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
1	10000	.	GA	G	30	PASS	DP=14	GT	0|0	1|0	1/1
1	20000	.	T	TA	30	PASS	DP=11	GT	0|0	0|1	0/0

Here are two blocks of Python code run on it with their respective outputs:

with pysam.VariantFile('example.vcf.gz') as vcf:
    header = vcf.header
    with pysam.VariantFile('pysam-dev-test.vcf.gz', 'w', header=header) as out_vcf:
        for record in vcf:
            record.stop = record.pos + 1
            out_vcf.write(record)

In this case, the output matches example.vcf.gz despite editing the record.stop positions, because the END field is not defined in the header. However, this code block:

with pysam.VariantFile('example.vcf.gz') as vcf:
    header = vcf.header
    header.add_line(f'##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate in reference for SV">')
    with pysam.VariantFile('pysam-dev-test.vcf.gz', 'w', header=header) as out_vcf:
        for record in vcf:
            record.stop = record.pos + 1
            out_vcf.write(record)

produces the following output:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10000000>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.14+htslib-1.14
##bcftools_viewCommand=view -o example.vcf.gz example.vcf; Date=Wed Jun 28 16:24:56 2023
##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate in reference for SV">
##bcftools_viewCommand=view pysam-dev-test.vcf.gz; Date=Thu Jun 29 10:51:49 2023
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
1	10000	.	GA	G	30	PASS	DP=14;END=10001	GT	0|0	1|0	1/1
1	20000	.	T	TA	30	PASS	DP=11;END=20001	GT	0|0	0|1	0/0

So the user can control whether END should appear in the INFO fields by toggling whether it should be included in the header or not, and then access it via record.stop as usual. I think this makes more conceptual sense to check whether to print the field or not based on the header values. Since the sync method uses the same formula for determining the END coordinate, it should be consistent with the existing paradigm in the other field setters.

@nihalraman
Copy link

Hi Ricky, thanks for the explanation. Has there been any update on this?

@rickymagner
Copy link
Author

Hi, I think it's up to the maintainers to decide if they want to change this functionality according to the suggestions made here, so I leave it up to them to respond. I haven't heard anything since initially posting this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants