Skip to content

abyzovlab/vcf2diploid

Repository files navigation

A more generalized version of the tools has been recently developed. Please check it out at https://github.com/doron-st/vcf2diploid

README file for vcf2diploid tool distribution

1. Compilation
==============

$ make


2. Running
==========

java -jar vcf2diploid.jar -id sample_id -chr file.fa ... [-vcf file.vcf ...]

where sample_id is the ID of individual whose genome is being constructed
(e.g., NA12878), file.fa is FASTA file(s) with reference sequence(s), and
file.vcf is VCF4.0 file(s) with variants. One can specify multiple FASTA and
VCF files at a time. Splitting the whole genome in multiple files (e.g., with
one FASTA file per chromosome) reduces memory usage.
Amount of memory used by Java can be increased as follows

java -Xmx4000m -jar vcf2diploid.jar -id sample_id -chr file.fa ... [-vcf file.vcf ...]

You can try the program by running 'test_run.sh' script in the 'example'
directory. See also "Important notes" below.


3. Constructing personal annotation and splice-junction library
===============================================================

* Using chain file one can lift over annotation of reference genome to personal
haplotpes. This can be done with the liftOver tools
(see http://hgdownload.cse.ucsc.edu/admin/exe).

For example, to lift over Gencode annotation once can do

$ liftOver -gff ref_annotation.gtf mat.chain mat_annotation.gtf not_lifted.txt


* To construct personal splice-junction library(s) for RNAseq analysis one can
use RSEQtools (http://archive.gersteinlab.org/proj/rnaseq/rseqtools).


Important notes
===============

All characters between '>' and first white space in FASTA header are used
internally as chromosome/sequence names. For instance, for the header

>chr1 human

vcf2diploid will upload the corresponding sequence into the memory under the
name 'chr1'.
Chromosome/sequence names should be consistent between FASTA and VCF files but
omission of 'chr' at the beginning is allows, i.e. 'chr1' and '1' are treated as
the same name.

The output contains (file formats are described below):
1) FASTA files with sequences for each haplotype.
2) CHAIN files relating paternal/maternal haplotype to the reference genome.
3) MAP files with base correspondence between paternal-maternal-reference
sequences.

File formats:
* FASTA -- see http://www.ncbi.nlm.nih.gov/blast/fasta.shtml
* CHAIN -- http://genome.ucsc.edu/goldenPath/help/chain.html
* MAP file represents block with equivalent bases in all three haplotypes
(paternal, maternal and reference) by one record with indices of the first
bases in each haplotype. Non-equivalent bases are represented as separate
records with '0' for haplotypes having non-equivalent base (see
clarification below).

Pat Mat Ref           MAP format
X   X   X    ____
X   X   X        \
X   X   X         --> P1 M1 R1
X   X   -    -------> P4 M4  0
X   X   -       ,--->  0 M6 R4
-   X   X    --'  ,-> P6 M7 R5
X   X   X    ----' 
X   X   X     
X   X   X     




For question and comments contact: Alexej Abyzov ([email protected]) and
Mark Gerstein ([email protected])