Updating docs, fixing bugs

abyzovlab · Feb 27, 2020 · 54f9fe9 · 54f9fe9
1 parent 1f50ad7
commit 54f9fe9
Show file tree

Hide file tree

Showing 3 changed files with 208 additions and 34 deletions.
diff --git a/GettingStarted.md b/GettingStarted.md
@@ -1,6 +1,8 @@
 # Getting Started with CNVpytor
 
-## Installing from GitHub
+## Install
+
+### Cloning GitHub repository
 
 ```
 > git clone https://github.com/abyzovlab/CNVpytor.git
@@ -12,73 +14,239 @@ For single user (without admin privileges) use:
 > python setup.py install --user
 ```
 
-## Import read depth signal
+### Using pip
 
+```
+> pip install cnvpytor
+> cnvpytor -download
+```
 
-Make sure that you have indexed SAM, BAM or CRAM file.
+Second line will download resource files from GitHub.
+
+## Import read depth signal
+
+Make sure that you have indexed alignment (SAM, BAM or CRAM) file.
 
 Initialize your CNVpytor project by running:
+```
+> cnvpytor -root file.pytor -rd file.bam [-chrom name1 ...] [-T ref.fa.gz]
+``` 
+where:
+
+* file.pytor -- specifies output cnvpytor file (HDF5 file)
+* name1 ... -- specifies chromosome name(s).
+* file.bam -- specifies bam/sam/cram file name.
+* -T ref.fa.gz -- specifies reference genome file (only for cram file without reference genome).
+
+
+Chromosome names must be specified the same way as they are described in the sam/bam/cram header, 
+e.g., chrX or X. One can specify multiple chromosomes separated by space. If no chromosome is specified, 
+read mapping is extracted for all chromosomes in the sam/bam file. 
+Note that the pytor file is not being overwritten.
+
+**Examples:**
+
+```
+> cnvpytor -root NA12878.pytor -chrom 1 2 3 -rd NA12878_ali.bam
+```
+for bam files with a header like this:
+```
+@HD VN:1.4 GO:none SO:coordinate
+@SQ SN:1 LN:249250621
+@SQ SN:2 LN:243199373
+@SQ SN:3 LN:198022430
+```
 
+or
+```
+> cnvpytor -root NA12878.pytor -chrom chr1 chr2 chr3 -rd NA12878_ali.bam
+```
+for bam files with a header like this:
 ```
-> cnvpytor -root file.pytor -rd file.bam
+@HD VN:1.4 GO:none SO:coordinate
+@SQ SN:chr1 LN:249250621
+@SQ SN:chr2 LN:243199373
+@SQ SN:chr3 LN:198022430
 ```
 
-File file.pytor will be created and read depth data binned to 100 base pair bins will be stored 
+
+After -rd step file file.pytor is created and read depth data binned to 100 base pair bins will be stored 
 in _pytor_ file.
 
-CNVpytor will detect reference genome and use internal database for GC content and 1000 genome strict mask.
+Chromosome names and lengths are parsed from the input file header and used to 
+detect reference genome. 
 
-After instalation this works for hg19 and hg38 genomes. For other species or reference genomes you have to 
-[describe reference genome](examples/AddReferenceGenome.md).
+Reference genome is important for for GC correction and 1000 genome strict mask filtering.
+CNVpytor installation includes human genomes hg19 and hg38 with data files that provide information
+about GC content and strict mask. For other species or reference genomes you have to 
+configure reference genome (see 
+[example](examples/AddReferenceGenome.md)).
 
 To check is reference genome detected use:
 
 ```
 > cnvpytor -root file.pytor -ls
 ```
+
 CNVpytor will print out details about file.pytor including line that specify which reference genome is
-used and are there available GC and mask data.
+used and are there available GC and mask data:
 ```
 Using reference genome: hg19 [ GC: yes, mask: yes ]
 ```
 
-## Calling CNV/CNA events
+Command -ls is useful if you want to check content of _pytor_ file but also date and version of cnvpytor 
+that created it.
 
+## Predicting CNV regions
 
+First we have to chose bin size. By cnvpytor design it have to be divisible by 100. 
+Here we will use 10 kbp and 100 kbp bins.
 
-
-First hose bin size. It has to be divisible by 100. Here we will use 10 kbp and 100 kbp bins.
-
-To calculate binned, GC corrected RD signal type:
+To calculate a read depth histograms, GC correction and statistics type:
 ```
 > cnvpytor -root file.pytor -his 10000 100000
 ```
-CNVpytor stores 
-
 
+Next step is partitioning using mean-shift method:
 ```
 > cnvpytor -root file.pytor -partition 10000 100000
 ```
 
+Finally we can call CNV regions using commands:
 ```
 > cnvpytor -root file.pytor -call 10000 > calls.10000.tsv
 > cnvpytor -root file.pytor -call 100000 > calls.100000.tsv
 ```
 
+Result is stored in tab separated files with following columns:
+* CNV type: "deletion" or "duplication",
+* chromosome name,
+* start position,
+* end position,
+* CNV size,
+* CNV level - read depth normalized to 1,
+* e-val1 -- p value calculated using t-test statistics between RD statistics in the region and global,
+* e-val2 -- p value from the probability of RD values within the region to be in the tails of a gaussian distribution of binned RD,
+* e-val3 -- same as e-val1 but for the middle of CNV,
+* e-val4 -- same as e-val2 but for the middle of CNV,
+* q0 -- fraction of reads mapped with q0 quality.
+
 
 ## Import SNP data
 
-### From single variant file
+### From variant file
+
+To import variant data from VCF file use following command:
+```
+> cnvpytor -root file.root -snp file.vcf.gz [-chrom name1 ...] [-ad AD_TAG] [-gt GT_TAG] [-noAD]
+```
+where:
+
+* file.pytor -- specifies cnvpytor file,
+* name1 ... -- specifies chromosome name(s),
+* file.vcf -- specifies variant file name.
+* -ad AD_TAG -- specifies AD tag used in vcf file (default AD)
+* -gt GT_TAG -- specifies GT tag used in vcf file (default GT)
+* -noAD -- ref and alt read counts will not be readed (see next section)
+
+Chromosome names must be specified the same way as they are described in the vcf header, 
+e.g., chrX or X. One can specify multiple chromosomes separated by space. If no chromosome is specified, 
+read mapping is extracted for all chromosomes in the vcf file.
+
+If chromosome names in variant and alignment file are different in prefix chr (e.g. in "1" and "chr1")
+cnvpytor will detect it and match the names using first imported name for both signals.
 
 ### Using SNP positions from variant file and counts from alignment file
 
-## Calculating RD histograms
+In some cases it is useful to read positions of SNPs from vcf file and 
+extract read counts from bam file. For example if we have two samples, normal tissue and 
+cancer, normal can be used to call germline SNPs, while samtools mpileup procedure can be used 
+to calculate read counts in cancer sample at the positions of SNPs. CNVpytor have
+implemented this procedure. After reading SNP positions (previous step) type:
 
-## Calculating BAF histograms
+```
+> cnvpytor -root file.pytor -pileup file.bam [-T ref.fa.gz]
+```
+where
+* file.pytor -- specifies cnvpytor file,
+* file.bam -- specifies bam/sam/cram file,
+* -T ref.fa.gz -- specifies reference genome file (only for cram file without reference genome).
+
+
+### Calculating BAF histograms
+
+```
+> cnvpytor -root file.pytor -baf 10000 100000 [-nomask] [-useid] [-usephase]
+```
+
+## Genotyping genomic regions
+
+Using -genotype option followed by bin_sizes you can enter region and genotype calculation
+for each bin size will be performed:
+
+```
+> cnvpytor -root file.pytor -genotype 10000 100000
+12:11396601-11436500
+12:11396601-11436500	1.933261	1.937531
+22:20999401-21300400
+22:20999401-21300400	1.949186	1.957068
+```
+
+**Example:**
+
+Genotype all called CNVs:
+
+```
+> awk '{ print $2 } END { print "exit" }' calls.10000.tsv | cnvpytor -root file.pytor -genotype 10000 100000
+```
+
+## Visualization
+
+### Plot from command line
+
+Chromosome wide plots:
+```
+> cnvpytor -root file.pytor -plot [rd BIN_SIZE] [likelihood BIN_SIZE] [baf BIN_SIZE] [snp] [-o IMAGE_FILENAME]
+```
+where
+* rd BIN_SIZE -- plots RD signal for all chromosomes
+* likelihood BIN_SIZE -- plots baf likelihood for all chromosomes
+* baf BIN_SIZE -- plots baf/maf/likelihood peak position for all chromosomes
+* snp -- plots baf for each snp for all chromosomes
+* -o IMAGE_FILENAME -- if specified, saves plot in file instead to show on the screen
+
+Manhattan plot:
+```
+> cnvpytor -root file.pytor -plot manhattan BIN_SIZE [-chrom name1 ...] [-o IMAGE_FILENAME]
+```
+
+Circular plot:
+```
+> cnvpytor -root file.pytor -plot circular BIN_SIZE [-chrom name1 ...] [-o IMAGE_FILENAME]
+```
+
+Plot genomic regions:
+```
+> cnvpytor -root file.pytor -plot regions [reg1[,| ]...] BIN_SIZE [-panels [rd] [likelihood] [baf] [snp] ...] [-o IMAGE_FILENAME]
+```
+where
+* reg1 -- comma or space separated regions in form CHR[:START-STOP], e.g. 1:1M-20M 2 3:200k-80000010
+* if regions are comma separated they will be plotted in the same subplot
+* space will split regions in different subplots
+* -panels -- specify which panels to plot: rd likelihood baf snp
+* -o IMAGE_FILENAME -- if specified, saves plot in file instead to show on the screen
+
+
+### Plot from interactive mode
+
+The best way to visualize cnvpytor results is interactive mode. Enter interactive mode by typing:
+```
+cnvpytor -root file.pytor -view BIN_SIZE
+```
 
-## Plotting
+There is tab completion and help similar to man pages. Type double tab or help to start.
 
-## Visualize CNVpytor data inside [JBrowse](https://github.com/GMOD/jbrowse)
+### Visualize CNVpytor data inside [JBrowse](https://github.com/GMOD/jbrowse)
 #### JBrowse version and plugins
 JBrowse version: https://github.com/GMOD/jbrowse/archive/1.16.6-release.tar.gz
 
@@ -139,4 +307,3 @@ For SNP data, it exports Binned BAF, Likelihood of the binned BAF signals. These
 
 cnvpytor does the segmentation for all of the above data based on the user provided bin size. The multiscalebigwig provides the option to show the data based on the visualized area on the reference genome, which means if a user visualizes a small region of the genome it shows small bin data and vice versa.           
 
-## Other
diff --git a/cnvpytor/root.py b/cnvpytor/root.py
@@ -12,6 +12,7 @@ class Root: main CNVpytor class
 from .viewer import anim_plot_likelihood, anim_plot_rd
 import numpy as np
 import logging
+import os.path
 
 _logger = logging.getLogger("cnvpytor.root")
 
@@ -233,6 +234,8 @@ def _read_vcf(self, vcf_file, chroms, sample='', use_index=False, no_counts=Fals
         vcff = Vcf(vcf_file)
         chrs = [c for c in vcff.get_chromosomes() if len(chroms) == 0 or c in chroms]
 
+        use_index = use_index and os.path.exists(vcf_file + ".tbi")
+
         def save_data(chr, pos, ref, alt, nref, nalt, gt, flag, qual):
             if (len(chroms) == 0 or chr in chroms) and (not pos is None) and (len(pos) > 0):
                 self.io.save_snp(chr, pos, ref, alt, nref, nalt, gt, flag, qual, callset=callset,
@@ -293,7 +296,8 @@ def rd(self, bamfiles, chroms=[], reference_filename=False):
                 self.io_gc = IO(Genome.reference_genomes[rg_name]["gc_file"])
                 self.rd_stat()
 
-    def vcf(self, vcf_files, chroms=[], sample='', no_counts=False, ad_tag="AD", gt_tag="GT", callset=None, use_index=False):
+    def vcf(self, vcf_files, chroms=[], sample='', no_counts=False, ad_tag="AD", gt_tag="GT", callset=None,
+            use_index=True):
         """ Read SNP data from variant file(s) and store in .cnvpytor file
 
         Parameters
@@ -319,7 +323,7 @@ def vcf(self, vcf_files, chroms=[], sample='', no_counts=False, ad_tag="AD", gt_
         """
         for vcf_file in vcf_files:
             self._read_vcf(vcf_file, chroms, sample, no_counts=no_counts, ad_tag=ad_tag, gt_tag=gt_tag,
-                                 callset=callset, use_index=use_index)
+                           callset=callset, use_index=use_index)
             self.io.add_meta_attribute("VCF", vcf_file)
 
     def rd_from_vcf(self, vcf_file, chroms=[], sample='', ad_tag="AD", dp_tag="DP", use_index=False):
@@ -1046,7 +1050,8 @@ def calc_grad_par(ks):
 
                         self.io.create_signal(c, bin_size, "RD partition", levels, flags=flag_rd)
 
-    def call(self, bin_sizes, chroms=[], print_calls=False, use_gc_corr=True, use_mask=False, genome_size=2.9e9, genome_cnv_fraction=0.01):
+    def call(self, bin_sizes, chroms=[], print_calls=False, use_gc_corr=True, use_mask=False, genome_size=2.9e9,
+             genome_cnv_fraction=0.01):
         """
         CNV caller based on the segmented RD signal.
 
@@ -1090,7 +1095,7 @@ def call(self, bin_sizes, chroms=[], print_calls=False, use_gc_corr=True, use_ma
                 rd_mask_chromosomes[rd_name] = c
 
         for bin_size in bin_sizes:
-            ret[bin_size]=[]
+            ret[bin_size] = []
             for c in self.io.rd_chromosomes():
                 if (c in rd_gc_chromosomes or not use_gc_corr) and (c in rd_mask_chromosomes or not use_mask) and (
                         len(chroms) == 0 or (c in chroms)):
@@ -1856,7 +1861,7 @@ def call_2d(self, bin_sizes, chroms=[], use_gc_corr=True, rd_use_mask=False, snp
                                 if (segments[j][0] - segments[i][-1]) < max_distance * (
                                         len(segments[i]) + len(segments[j])) and \
                                         normal_overlap(level[i], error[i], level[j], error[j]) * likelihood_overlap(
-                                        likelihood[i], likelihood[j]) == maxo:
+                                    likelihood[i], likelihood[j]) == maxo:
                                     nl, ne = normal_merge(level[i], error[i], level[j], error[j])
                                     nlh = likelihood[i] * likelihood[j]
                                     level[i] = nl
@@ -1898,9 +1903,9 @@ def call_2d(self, bin_sizes, chroms=[], use_gc_corr=True, rd_use_mask=False, snp
                             rd_p = t_test_1_sample(mean, level[i], error[i], len(segments[i]))
                             baf_mean, baf_p = likelihood_baf_pval(likelihood[i])
                             print(c + ":" + str(segments[i][0] * bin_size + 1) + "-" + str(
-                                    segments[i][-1] * bin_size + bin_size),
-                                      (segments[i][-1] - segments[i][0] + 1) * bin_size, bin_size, len(segments[i]),
-                                      rd_mean, rd_p, baf_mean, baf_p)
+                                segments[i][-1] * bin_size + bin_size),
+                                  (segments[i][-1] - segments[i][0] + 1) * bin_size, bin_size, len(segments[i]),
+                                  rd_mean, rd_p, baf_mean, baf_p)
 
                         self.io.create_signal(c, bin_size, "RD mosaic segments",
                                               data=segments_code(segments), flags=flag_rd)

diff --git a/examples/AddReferenceGenome.md b/examples/AddReferenceGenome.md
@@ -4,7 +4,7 @@ For GC correction and 1000 genome strict mask filtering CNVpytor uses informatio
 related to the reference genome. With installation two reference genomes are
 available: hg19 (GRCh37) and hg28 (GRCh38).
 
-If you want to use other reference genome for human or other species first we have 
+If we want to use CNVpytor for other species or different reference genome for human first we have 
 to create GC and mask file (optional).
 
 In this example we will configure mouse reference genome MGSCv37.
@@ -34,7 +34,7 @@ apart from the filtering step.
 You may also generate your own mask file by creating fasta file that contains character "P" if corresponding 
 base pair passes the filter and any character different than "P" if not.
 
-Now, we will create example_ref_genome_conf.py file containing following:
+Now, we will create _example_ref_genome_conf.py_ file containing list of chromosomes and chromosome lengths:
 
 ```
 import_reference_genomes = {
@@ -56,7 +56,9 @@ import_reference_genomes = {
 }
 ```
 
-Last line can be skipped, if there is no mask file. 
+Last line can be skipped, if there is no mask file. Letter next to chromosome length denote type of a chromosome:
+A - autosome, S - sex chromosome, M - mitochondria.
+
 
 To use CNVpytor with new reference genome us -conf option in each cnvpytor command, e.g.
 ```
@@ -71,7 +73,7 @@ cnvpytor -conf REL_PATH/example_ref_genome_conf.py -root file.pytor -rg mm9
 ```
 
 To avoid typing "-conf REL_PATH/example_ref_genome_conf.py" each time you run cnvpytor, 
-you can create an alias. However, we would like to encourage you to send us configuration, 
+you can create an bash alias. However, we would like to encourage you to send us configuration, 
 gc and mask file and we would be glad to include it into the CNVpytor code. Or, even better, 
 fork the repository on GitHub, add configuration in cnvpytor/genome.py, data files in cnvpytor/data 
 and create a pull request.