Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check sgkit documentation how to import a VCF file & how we will move towards using sgkit in this tree seq pipeline #55

Open
gregorgorjanc opened this issue Nov 17, 2023 · 3 comments
Assignees

Comments

@gregorgorjanc
Copy link
Member

There are two parts here.

Part 1: Check sgkit documentation how to import a VCF file

This is sgkit:

We need to learn how to use sgkit, so best to check how to import a VCF file into Python and create an sgkit object - how we do this...

Part 2: How we will move towards using sgkit in this tree seq pipeline

Once we master the above part, we should be looking into how to do what Ben Jeffrey is doing at https://github.com/benjeffery/tsinfer-snakemake/tree/main

@AprilYUZhang
Copy link

The script that we would like to change eventually is

https://github.com/HighlanderLab/tree_seq_pipeline/blob/main/Snakemake/workflow/scripts/Prepare_tsinfer_sample_file.py

to what Ben Jeffrey and the team uses based on sgkit

@AprilYUZhang
Copy link

Using sgkit is very simple and nicely explained at https://pystatgen.github.io/sgkit/latest/vcf.html#download-the-data - there they download some public VCF files and show how to convert these to ZARR files which sgkit works with.

@AprilYUZhang
Copy link

Looking at https://github.com/benjeffery/tsinfer-snakemake/blob/main/steps.py, we see the code like the one below, where tsinfer.SgkitSampleData(input[0].replace("variant_mask", "")) is called, where we guess the input is either a ZARR file and the tsinfer.SgkitSampleData() might be doing all the magic.

HOWEVER, the pipeline at https://github.com/benjeffery/tsinfer-snakemake is quite involved so we bette talk with Ben to discuss how to take this onwards

def generate_ancestors(input, output, wildcards, config, threads):  # noqa: A002
    import tsinfer
    import logging
    import os
    from pathlib import Path

    logging.basicConfig(level=logging.INFO)
    data_dir = Path(config["data_dir"])
    sample_data = tsinfer.SgkitSampleData(input[0].replace("variant_mask", ""))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants