Fasta to SampleData #674
Answered
by
hyanwong
reedacartwright
asked this question in
Q&A
-
I have a Fasta file with ~1000 phased sequences turned into biallelic characters (0's and 1's), from SNPs and indels. What's the best way to turn this into a SampleData object for use with the tsinfer CLI? Note: I'm not a Python programmer. Example:
|
Beta Was this translation helpful? Give feedback.
Answered by
hyanwong
Jul 4, 2022
Replies: 2 comments 1 reply
-
The simplest thing to do would be to convert to a VCF in some way, and then follow the standard methods outlined in the tutorial. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Alternatively, if you just want to read it in as a massive matrix: import numpy as np
import tsinfer
# slurp it all into a big matrix: assumes all data for a seq is on one line
# otherwise use a text editor to delete all newlines except those
# followed by ">"
binary_data = np.genfromtxt(
"tmp.fasta",
comments=">", # ignore any lines starting with ">"
delimiter=1, # one char per value.
dtype=int,
)
with tsinfer.SampleData(
path="my_data.samples",
sequence_length=binary_data.shape[1]
) as sd:
for pos, column in enumerate(binary_data.T): # iterate over transposed matrix
sd.add_site(pos, column) |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
reedacartwright
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Alternatively, if you just want to read it in as a massive matrix: