How do you add ancestral state to SampleData object? #545
-
I tried adding the ancestral allele .fa file to a multiple sample phased vcf file using this:
but I get errors. There may be something wrong with my files causing Beagle to not phase the whole file. I don't know yet, but are there other ways the ancestral allele can be added in the pipeline? I was considering creating the tsinfer SampleData object and using the add_site method to add the ancestral alleles from the .fa file to the allele argument in the method. Do you have a function or method for this or would this be something that I would need to create? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 8 replies
-
@awohns wrote some code to do this, I think, but it might be helpful to post demo code here. The main problem is going to be to ensure that you are using the same chromosome build in the VCF as the fasta file, because any function we write here might not be able to check. Perhaps we should at least check that the chromosome lengths are the same in the VCF and the fasta? The code we used is here: https://github.com/mcveanlab/treeseq-inference/blob/0cbbb062c96ad4433d8b4d0f120f93ac2d985345/human-data/convert.py#L581 coupled with the function at https://github.com/mcveanlab/treeseq-inference/blob/0cbbb062c96ad4433d8b4d0f120f93ac2d985345/human-data/convert.py#L89. It uses the pysam library. I guess you could steal a few lines from there to put into a script? |
Beta Was this translation helpful? Give feedback.
-
Since it seems to be taking a long time to read in your data, @lakishadavid, note that there is some example code which does so in parallel from a VCF file. I have't adjusted it to use FASTA for the ancestral state, but I guess that's possible, since sam tools makes a random access index for the FASTA. Here's the link to the code: |
Beta Was this translation helpful? Give feedback.
-
Thanks, @hyanwong and @jeromekelleher. The above code works after revising it to skip At first, I reduced my data vcf file to the header and chromosome 1 variants. Then I completed a few rounds of running the script (unsuccessfully) and removing variants indicated by the error output ending in
I was then able to run the I can now take what I've learned plus additional readings of the documentation and examples to refine my script. It was key to see @jeromekelleher 's code on getting all the ancestral states because I'm sure I would have missed some important details explained in the commented lines. Ordering the genotypes based on this information was an unexpected challenge that I didn't recognize until after I was able to add the alleles. |
Beta Was this translation helpful? Give feedback.
-
Note that specifying the ancestral state in the new VariantData interface (alpha version) is now easy, so I'm closing this. |
Beta Was this translation helpful? Give feedback.
@awohns wrote some code to do this, I think, but it might be helpful to post demo code here. The main problem is going to be to ensure that you are using the same chromosome build in the VCF as the fasta file, because any function we write here might not be able to check. Perhaps we should at least check that the chromosome lengths are the same in the VCF and the fasta?
The code we used is here: https://github.com/mcveanlab/treeseq-inference/blob/0cbbb062c96ad4433d8b4d0f120f93ac2d985345/human-data/convert.py#L581 coupled with the function at https://github.com/mcveanlab/treeseq-inference/blob/0cbbb062c96ad4433d8b4d0f120f93ac2d985345/human-data/convert.py#L89. It uses the pysam library. I …