How do you add ancestral state to SampleData object? #545

lakishadavid · 2021-06-19T07:59:01Z

lakishadavid
Jun 19, 2021

I tried adding the ancestral allele .fa file to a multiple sample phased vcf file using this:

bcftools +fill-from-fasta PhasedSamples.gt.vcf.gz -- -c AA -f homo_sapiens_ancestor_1.fa --header-lines PhasedSamples_header.txt

but I get errors. There may be something wrong with my files causing Beagle to not phase the whole file. I don't know yet, but are there other ways the ancestral allele can be added in the pipeline? I was considering creating the tsinfer SampleData object and using the add_site method to add the ancestral alleles from the .fa file to the allele argument in the method. Do you have a function or method for this or would this be something that I would need to create?

Answered by hyanwong

Jun 19, 2021

@awohns wrote some code to do this, I think, but it might be helpful to post demo code here. The main problem is going to be to ensure that you are using the same chromosome build in the VCF as the fasta file, because any function we write here might not be able to check. Perhaps we should at least check that the chromosome lengths are the same in the VCF and the fasta?

The code we used is here: https://github.com/mcveanlab/treeseq-inference/blob/0cbbb062c96ad4433d8b4d0f120f93ac2d985345/human-data/convert.py#L581 coupled with the function at https://github.com/mcveanlab/treeseq-inference/blob/0cbbb062c96ad4433d8b4d0f120f93ac2d985345/human-data/convert.py#L89. It uses the pysam library. I …

View full answer

hyanwong · 2021-06-19T11:25:38Z

hyanwong
Jun 19, 2021
Collaborator

@awohns wrote some code to do this, I think, but it might be helpful to post demo code here. The main problem is going to be to ensure that you are using the same chromosome build in the VCF as the fasta file, because any function we write here might not be able to check. Perhaps we should at least check that the chromosome lengths are the same in the VCF and the fasta?

The code we used is here: https://github.com/mcveanlab/treeseq-inference/blob/0cbbb062c96ad4433d8b4d0f120f93ac2d985345/human-data/convert.py#L581 coupled with the function at https://github.com/mcveanlab/treeseq-inference/blob/0cbbb062c96ad4433d8b4d0f120f93ac2d985345/human-data/convert.py#L89. It uses the pysam library. I guess you could steal a few lines from there to put into a script?

6 replies

lakishadavid Jun 21, 2021
Author

Code

import tskit
import tsinfer
import attr
import cyvcf2
import pysam
import tqdm
import allel
import zarr
import re
import numpy as np
import pandas as pd

# convert the multi individual vcf file to zarr format
allel.vcf_to_zarr('/results/PhasedSamples.gt.vcf.gz', '/results/PhasedSamples.zarr.gz', fields='*', overwrite=True)
callset = zarr.open_group('/results/PhasedSamples.zarr.gz', mode='r')
print(callset.tree(expand = True))

Output from print command

/
 ├── calldata
 │   ├── DS (1019076, 60, 3) float32
 │   └── GT (1019076, 60, 2) int8
 ├── samples (60,) object
 └── variants
     ├── AF (1019076, 3) float32
     ├── ALT (1019076, 3) object
     ├── CHROM (1019076,) object
     ├── DR2 (1019076, 3) float32
     ├── FILTER_PASS (1019076,) bool
     ├── ID (1019076,) object
     ├── IMP (1019076,) bool
     ├── POS (1019076,) int32
     ├── QUAL (1019076,) float32
     ├── REF (1019076,) object
     ├── altlen (1019076, 3) int32
     ├── is_snp (1019076,) bool
     └── numalt (1019076,) int32

Code continued

# get the individual names for the SampleData add_individual metdata
individual_names = []
pattern = r'/samples/'
for item in list(callset.samples):
    mod_string = re.sub(pattern, '', item )
    individual_names.append(str(mod_string))

# method to get all ancestral states

ancestral_states_file = "/references/homo_sapiens_ancestor_GRCh38/homo_sapiens_ancestor_autosomes.fa"
fasta = pysam.FastaFile(ancestral_states_file)

# NB! We put in an extra character at the start to convert to 1 based coords.
ancestral_states = "X" + fasta.fetch(reference = fasta.references[0])

# The largest possible site position is len(ancestral_states). Positions must be strictly less than sequence_length, so we add 1.
sequence_length = len(ancestral_states) + 1

# method to get the population index for an individual

# sample csv file with individual's Name and Population assignments
df = pd.read_csv('/samples/samples_dataframe.csv', names = ['Name', 'Population'])

def get_pop_index(person):
    tsinfer_population_index = {
        "Unassigned": 0,
        "Paga Ghana": 1,
        "Burkina Faso": 2,
        "Fante": 3,
        "African American": 4,
        "Northern Ghana, other": 5,
        "African, other": 6,
        "African Descent, other": 7,
    }

    person = df.loc[df['Name'] == person]
    pop_search = person.iloc[0]['Population']
    pop_index = tsinfer_population_index[pop_search]
    return pop_index

# method to get the ancestral state for each site

def get_ancestral_state(POS):
    ancestral_state = None
    
    try:
        ancestral_state = ancestral_states[POS]
    
        if ancestral_state in ["A", "C", "T", "G"]:
            ancestral_state = ancestral_state
        else:
            ancestral_state = None
    except:
        ancestral_state = None
    
    return ancestral_state

# method to get the new genotype order, 

# referencing the alleles as [ancestral state, REF].
# use [ancestral state, ALT] if ancestral state and REF are the same
# this works but needs additional error handling and to deal with condition when ALT is Null
def get_genotypes(GT, REF, ALT, ancestral_state):
    flat_list = []
    for sublist in GT:
        for item in sublist:
            flat_list.append(item)
        genotype = np.array(flat_list)
    
    allele1 = -1
    allele2 = -1

    if REF == ancestral_state:
        allele1 = ancestral_state
        allele2 = ALT[0]
        genotypes = genotype
        note = ("route 1")
    elif ALT[0] == ancestral_state:
        allele1 = ancestral_state
        allele2 = REF
        new_order = []
        for item in genotype:
            dict = {
                0 : 1,
                1 : 0,
            }
            new_item = dict[item]
            new_order.append(new_item)
            genotypes = new_order
            note = ("route 2")
    else:
        new_order = []
        for item in genotype:
            allele1 = ancestral_state
            allele2 = REF
            dict = {
                0 : REF,
                1 : ALT[0],
            }
            if dict[item] == ancestral_state:
                new_item = 0
                new_order.append(new_item)
            elif dict[item] == REF:
                new_item = 1
                new_order.append(new_item)
            else:
                new_item = tskit.MISSING_DATA
                new_order.append(new_item)
            genotypes = new_order
            note = ("route 3")
    
    return genotypes, allele1, allele2, note

# create the tsinfer SampleData object
progress = tqdm.tqdm(total=len(callset.variants.ID))
with tsinfer.SampleData(path = '/results/project_data.samples') as samples:

    # Define populations
    samples.add_population(metadata={"name": "Unassigned"})
    samples.add_population(metadata={"name": "Paga Ghana"})
    samples.add_population(metadata={"name": "Burkina Faso"})
    samples.add_population(metadata={"name": "Fante"})
    samples.add_population(metadata={"name": "African American"})
    samples.add_population(metadata={"name": "Northern Ghana, other"})
    samples.add_population(metadata={"name": "African, other"})
    samples.add_population(metadata={"name": "African Descent, other"})

    # Define individuals
    for person in individual_names:
    
        try:
            samples.add_individual(
                ploidy = 2, 
                population = get_pop_index(person), 
                metadata = {"name": person}
            )
        except:
            print(person)

    # Define sites and genotypes
    for number, site in enumerate(callset.variants.ID):
        
        POS = callset.variants.POS[number]
        REF = callset.variants.REF[number]
        ALT = callset.variants.ALT[number]
        GT = callset.calldata.GT[number]
        
        if ALT[1] == "":
            ancestral_state = get_ancestral_state(POS)
            if ancestral_state:
                genotypes, allele1, allele2, note = get_genotypes(GT, REF, ALT, ancestral_state)
           
                try:
                    samples.add_site(
                        position = POS,
                        genotypes = genotypes,
                        alleles = np.append(allele1, allele2)
                    )

                except:
                    pass
                
            else:
                continue
        else:
            continue
        progress.update()
        
    progress.close()

print(
    "Sample file created for {} samples ".format(samples.num_samples)
    + "({} individuals) ".format(samples.num_individuals)
    + "with {} variable sites.".format(samples.num_sites),
    flush=True,
)

lakishadavid Jun 21, 2021
Author

However, this only gets as far as 76%|███████▌ | 771978/1019076 [5:29:52<1:45:35, 39.00it/s] and then exits due to type error: TypeError: Object of type ndarray is not JSON serializable.

TypeError

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-80-2971eae734f3> in <module>
     81         progress.update()
     82 
---> 83     progress.close()

~/.local/lib/python3.8/site-packages/tsinfer/formats.py in __exit__(self, *args)
    422     def __exit__(self, *args):
    423         if self._mode != self.READ_MODE:
--> 424             self.finalise()
    425         elif self.path is not None:
    426             self.close()

~/.local/lib/python3.8/site-packages/tsinfer/formats.py in finalise(self)
   1823                 self._samples_writer.flush()
   1824             elif self._build_state == self.ADDING_SITES:
-> 1825                 self._sites_writer.flush()
   1826             if self.num_sites == 0:
   1827                 raise ValueError("Must add at least one site")

~/.local/lib/python3.8/site-packages/tsinfer/formats.py in flush(self)
    233         It is an error to call ``add`` after ``flush`` has been called.
    234         """
--> 235         self._queue_flush_buffer()
    236         # Stop the the worker threads.
    237         for _ in range(self.num_threads):

~/.local/lib/python3.8/site-packages/tsinfer/formats.py in _queue_flush_buffer(self)
    207         else:
    208             logger.debug("Syncronously flushing buffer")
--> 209             self._commit_write_buffer(self.write_buffer)
    210         self.num_buffered_items[self.write_buffer] = 0
    211         self.start_offset[self.write_buffer] = self.total_items

~/.local/lib/python3.8/site-packages/tsinfer/formats.py in _commit_write_buffer(self, write_buffer)
    179         for key, array in self.arrays.items():
    180             buffered = self.buffers[key][write_buffer][:n]
--> 181             array[start:end] = buffered
    182         logger.debug(f"Buffer {write_buffer} flush done")
    183 

~/.local/lib/python3.8/site-packages/zarr/core.py in __setitem__(self, selection, value)
   1211 
   1212         fields, selection = pop_fields(selection)
-> 1213         self.set_basic_selection(selection, value, fields=fields)
   1214 
   1215     def set_basic_selection(self, selection, value, fields=None):

~/.local/lib/python3.8/site-packages/zarr/core.py in set_basic_selection(self, selection, value, fields)
   1306             return self._set_basic_selection_zd(selection, value, fields=fields)
   1307         else:
-> 1308             return self._set_basic_selection_nd(selection, value, fields=fields)
   1309 
   1310     def set_orthogonal_selection(self, selection, value, fields=None):

~/.local/lib/python3.8/site-packages/zarr/core.py in _set_basic_selection_nd(self, selection, value, fields)
   1597         indexer = BasicIndexer(selection, self)
   1598 
-> 1599         self._set_selection(indexer, value, fields=fields)
   1600 
   1601     def _set_selection(self, indexer, value, fields=None):

~/.local/lib/python3.8/site-packages/zarr/core.py in _set_selection(self, indexer, value, fields)
   1649 
   1650                 # put data
-> 1651                 self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
   1652         else:
   1653             lchunk_coords, lchunk_selection, lout_selection = zip(*indexer)

~/.local/lib/python3.8/site-packages/zarr/core.py in _chunk_setitem(self, chunk_coords, chunk_selection, value, fields)
   1886 
   1887         with lock:
-> 1888             self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
   1889                                        fields=fields)
   1890 

~/.local/lib/python3.8/site-packages/zarr/core.py in _chunk_setitem_nosync(self, chunk_coords, chunk_selection, value, fields)
   1891     def _chunk_setitem_nosync(self, chunk_coords, chunk_selection, value, fields=None):
   1892         ckey = self._chunk_key(chunk_coords)
-> 1893         cdata = self._process_for_setitem(ckey, chunk_selection, value, fields=fields)
   1894         # store
   1895         self.chunk_store[ckey] = cdata

~/.local/lib/python3.8/site-packages/zarr/core.py in _process_for_setitem(self, ckey, chunk_selection, value, fields)
   1950 
   1951         # encode chunk
-> 1952         return self._encode_chunk(chunk)
   1953 
   1954     def _chunk_key(self, chunk_coords):

~/.local/lib/python3.8/site-packages/zarr/core.py in _encode_chunk(self, chunk)
   1999         if self._filters:
   2000             for f in self._filters:
-> 2001                 chunk = f.encode(chunk)
   2002 
   2003         # check object encoding

~/.local/lib/python3.8/site-packages/numcodecs/json.py in encode(self, buf)
     59         items.append(buf.dtype.str)
     60         items.append(buf.shape)
---> 61         return self._encoder.encode(items).encode(self._text_encoding)
     62 
     63     def decode(self, buf, out=None):

~/anaconda3/lib/python3.8/json/encoder.py in encode(self, o)
    197         # exceptions aren't as detailed.  The list call should be roughly
    198         # equivalent to the PySequence_Fast that ''.join() would do.
--> 199         chunks = self.iterencode(o, _one_shot=True)
    200         if not isinstance(chunks, (list, tuple)):
    201             chunks = list(chunks)

~/anaconda3/lib/python3.8/json/encoder.py in iterencode(self, o, _one_shot)
    255                 self.key_separator, self.item_separator, self.sort_keys,
    256                 self.skipkeys, _one_shot)
--> 257         return _iterencode(o, 0)
    258 
    259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

~/anaconda3/lib/python3.8/json/encoder.py in default(self, o)
    177 
    178         """
--> 179         raise TypeError(f'Object of type {o.__class__.__name__} '
    180                         f'is not JSON serializable')
    181 

TypeError: Object of type ndarray is not JSON serializable

hyanwong Jun 21, 2021
Collaborator

Hmm, that's a pain. Sorry. I wonder what the array is that it's trying to dump into the SampleData file. Are there 1019076 variants? if so, can you peek at variants 771977, 771978 (where it claims to have failed), and 771979 and see if anything looks weird?

lakishadavid Jun 21, 2021
Author

Yes, that's the correct number of variants: 1,019,076 (which is consistent with the count from Beagle). I'll check the possible problem sites in the vcf file and will probably just go ahead and remove them, keeping a log of the decisions. I'll post another update.

Output from Beagle 5.2

Cumulative Statistics:

Study     markers:            1,019,076

Haplotype phasing time:        22 minutes 48 seconds
Total time:                    26 minutes 45 seconds

End time: 06:50 PM EDT on 20 Jun 2021
beagle.29May21.d6d.jar finished

jeromekelleher Jun 21, 2021
Maintainer

Looks like there's a metadata issue @lakishadavid - maybe try doing the import without metadata first, and see how that goes?

hyanwong · 2021-06-21T09:01:42Z

hyanwong
Jun 21, 2021
Collaborator

Since it seems to be taking a long time to read in your data, @lakishadavid, note that there is some example code which does so in parallel from a VCF file. I have't adjusted it to use FASTA for the ancestral state, but I guess that's possible, since sam tools makes a random access index for the FASTA. Here's the link to the code:

#277 (comment)

1 reply

lakishadavid Jun 23, 2021
Author

@hyanwong . I revised the code to use cyvcf2 (instead of converting from vcf to zarr) and the performance is much better!

76%|███████▌ | 771978/1019076 [06:18<02:01, 2039.10it/s].

Wow! Thanks for pointing that out. I'll also check the parallel processing.

lakishadavid · 2021-06-22T00:55:29Z

lakishadavid
Jun 22, 2021
Author

Thanks, @hyanwong and @jeromekelleher. The above code works after revising it to skip samples.add_site when the second allele entry was blank.

At first, I reduced my data vcf file to the header and chromosome 1 variants. Then I completed a few rounds of running the script (unsuccessfully) and removing variants indicated by the error output ending in TypeError: Object of type ndarray is not JSON serializable. After searching online some more to better understand what causes this error, I revised the script by adding alleles = alleles.tolist() before samples.add_site and set alleles = alleles within samples.add_site. After running the script this time, I noticed that sometimes alleles[1] was empty which I'm guessing is the reason for the error output message. So, then, finally I added if alleles[1] != "" and alleles[1] != "N": before samples.add_site and the script ran and ended without an error message. The output to the last print command was:

Sample file created for 120 samples (60 individuals) with 59776 variable sites.

I was then able to run the tsinfer.infer(samples) command and view tables.

I can now take what I've learned plus additional readings of the documentation and examples to refine my script. It was key to see @jeromekelleher 's code on getting all the ancestral states because I'm sure I would have missed some important details explained in the commented lines. Ordering the genotypes based on this information was an unexpected challenge that I didn't recognize until after I was able to add the alleles.

1 reply

hyanwong Nov 1, 2022
Collaborator

FYI, the new tsinfer 0.3 doesn't require allele ordering to be based on the ancestral state. However, we are likely to release tsinfer 0.4 in the relatively near future, and this may be more flexible still.

hyanwong · 2024-08-15T19:16:51Z

hyanwong
Aug 15, 2024
Collaborator

Note that specifying the ancestral state in the new VariantData interface (alpha version) is now easy, so I'm closing this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you add ancestral state to SampleData object? #545

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How do you add ancestral state to SampleData object? #545

lakishadavid Jun 19, 2021

Replies: 4 comments · 8 replies

hyanwong Jun 19, 2021 Collaborator

lakishadavid Jun 21, 2021 Author

Code

Output from print command

Code continued

lakishadavid Jun 21, 2021 Author

TypeError

hyanwong Jun 21, 2021 Collaborator

lakishadavid Jun 21, 2021 Author

Output from Beagle 5.2

jeromekelleher Jun 21, 2021 Maintainer

hyanwong Jun 21, 2021 Collaborator

lakishadavid Jun 23, 2021 Author

lakishadavid Jun 22, 2021 Author

hyanwong Nov 1, 2022 Collaborator

hyanwong Aug 15, 2024 Collaborator

lakishadavid
Jun 19, 2021

Replies: 4 comments 8 replies

hyanwong
Jun 19, 2021
Collaborator

lakishadavid Jun 21, 2021
Author

lakishadavid Jun 21, 2021
Author

hyanwong Jun 21, 2021
Collaborator

lakishadavid Jun 21, 2021
Author

jeromekelleher Jun 21, 2021
Maintainer

hyanwong
Jun 21, 2021
Collaborator

lakishadavid Jun 23, 2021
Author

lakishadavid
Jun 22, 2021
Author

hyanwong Nov 1, 2022
Collaborator

hyanwong
Aug 15, 2024
Collaborator