PanGenome_VCF_PostProcess

A simple pipeline to process the VCF deconstructed from the VG pangenome graph

Remove large variants with vcfbub -l 0 -r 10000000;
Trim ALT alleles not seen in the genotype fields with bcftools view -a;
Group the alternative alleles in multiallelic SV loci by their length (see allele-group.pdf);
Split multiallelics with bcftools norm -m -any;
Classify the variants into small variants and SVs (abs(length(ALT_allele)-length(REF_allele))>=50).

Usage

1. Installation

conda or mamba is required

# clone the repo
git clone https://github.com/Shuhua-Group/PanGenome_VCF_PostProcess.git
# create the enviroment
cd PanGenome_VCF_PostProcess
mamba env create -f environment.yaml
# activate the enviroment
mamba activate PanVCF

2. Modify the configfile

Add the working directory work_dir: and PanGenome VCF file pan_vcf to config.yaml

# where you want the pipeline work
work_dir: ${wd}
# the vcf file deconstructed from the vg pangenome 
pan_vcf: ${PanGenome_VCF}
# prefix of output files 
out_prefix: CPC.HPRC.Phase1.processed
# allele grouping threads
threads: 16

3. Run the pipeline

Make sure that PanGenome_vcf.post_process.smk, config.yaml, and module.py are under the same directory:

snakemake -c 128 -s PanGenome_vcf.post_process.smk

It cost about two hours in a 128 core sever to process the CPC+HPRC pangenome, and the processed VCF files are:

# biallelics small variants (abs(length(ALT_allele)-length(REF_allele))<50)
CPC.HPRC.Phase1.processed.small_variants.normed.vcf.gz
# biallelics SVs (abs(length(ALT_allele)-length(REF_allele))>=50)
CPC.HPRC.Phase1.processed.SVs.normed.vcf.gz

[Note] We have updated the process strategy about the complex loci with both small variants and SVs alleles, so that the number of variants is slightly different than in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
PanGenome_vcf.post_process.smk		PanGenome_vcf.post_process.smk
README.md		README.md
allele-group.pdf		allele-group.pdf
config.yaml		config.yaml
enviroment.yaml		enviroment.yaml
module.py		module.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PanGenome_VCF_PostProcess

Usage

1. Installation

2. Modify the configfile

3. Run the pipeline

About

Releases

Packages

Languages

Shuhua-Group/PanGenome_VCF_PostProcess

Folders and files

Latest commit

History

Repository files navigation

PanGenome_VCF_PostProcess

Usage

1. Installation

2. Modify the configfile

3. Run the pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages