Fasta to SampleData #674

reedacartwright · 2022-06-04T02:32:26Z

reedacartwright
Jun 4, 2022

I have a Fasta file with ~1000 phased sequences turned into biallelic characters (0's and 1's), from SNPs and indels. What's the best way to turn this into a SampleData object for use with the tsinfer CLI? Note: I'm not a Python programmer.

Example:

>Seq_1
00000
>Seq_2
01000
>Seq_3
01100

Answered by hyanwong

Jul 4, 2022

Alternatively, if you just want to read it in as a massive matrix:

import numpy as np
import tsinfer
# slurp it all into a big matrix: assumes all data for a seq is on one line
# otherwise use a text editor to delete all newlines except those
# followed by ">"
binary_data = np.genfromtxt(
    "tmp.fasta",
    comments=">",  # ignore any lines starting with ">"
    delimiter=1,  # one char per value. 
    dtype=int,
)

with tsinfer.SampleData(
    path="my_data.samples",
    sequence_length=binary_data.shape[1]
) as sd:
    for pos, column in enumerate(binary_data.T):  # iterate over transposed matrix
        sd.add_site(pos, column)

View full answer

jeromekelleher · 2022-06-06T09:34:09Z

jeromekelleher
Jun 6, 2022
Maintainer

The simplest thing to do would be to convert to a VCF in some way, and then follow the standard methods outlined in the tutorial.

0 replies

hyanwong · 2022-07-04T11:26:31Z

hyanwong
Jul 4, 2022
Collaborator

Alternatively, if you just want to read it in as a massive matrix:

import numpy as np
import tsinfer
# slurp it all into a big matrix: assumes all data for a seq is on one line
# otherwise use a text editor to delete all newlines except those
# followed by ">"
binary_data = np.genfromtxt(
    "tmp.fasta",
    comments=">",  # ignore any lines starting with ">"
    delimiter=1,  # one char per value. 
    dtype=int,
)

with tsinfer.SampleData(
    path="my_data.samples",
    sequence_length=binary_data.shape[1]
) as sd:
    for pos, column in enumerate(binary_data.T):  # iterate over transposed matrix
        sd.add_site(pos, column)

1 reply

reedacartwright Jul 19, 2022
Author

Thank you for the code. I'll try it out and let you know if it worked for me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fasta to SampleData #674

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Fasta to SampleData #674

reedacartwright Jun 4, 2022

Replies: 2 comments · 1 reply

jeromekelleher Jun 6, 2022 Maintainer

hyanwong Jul 4, 2022 Collaborator

reedacartwright Jul 19, 2022 Author

reedacartwright
Jun 4, 2022

Replies: 2 comments 1 reply

jeromekelleher
Jun 6, 2022
Maintainer

hyanwong
Jul 4, 2022
Collaborator

reedacartwright Jul 19, 2022
Author