Skip to content

Commit

Permalink
Updating docs, fixing bugs
Browse files Browse the repository at this point in the history
  • Loading branch information
suvakov committed Feb 27, 2020
1 parent 2531618 commit 1f50ad7
Show file tree
Hide file tree
Showing 3 changed files with 83 additions and 4 deletions.
5 changes: 3 additions & 2 deletions GettingStarted.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ in _pytor_ file.

CNVpytor will detect reference genome and use internal database for GC content and 1000 genome strict mask.

This works for hg19 and hg38 genomes. For other species or reference genomes you have to
[specify reference genome](examples/AddReferenceGenome.md).
After instalation this works for hg19 and hg38 genomes. For other species or reference genomes you have to
[describe reference genome](examples/AddReferenceGenome.md).

To check is reference genome detected use:

Expand All @@ -46,6 +46,7 @@ Using reference genome: hg19 [ GC: yes, mask: yes ]




First hose bin size. It has to be divisible by 100. Here we will use 10 kbp and 100 kbp bins.

To calculate binned, GC corrected RD signal type:
Expand Down
3 changes: 1 addition & 2 deletions cnvpytor/genome.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,8 +282,7 @@ def load_reference_genomes(cls, filename):
"""
_logger.info("Reading configuration file '%s'." % filename)
import_reference_genomes = {}
exec(open(filename).read())
exec(open(filename).read(),globals())
for g in import_reference_genomes:
_logger.info("Importing reference genome data: '%s'." % g)
cls.reference_genomes[g] = import_reference_genomes[g]
79 changes: 79 additions & 0 deletions examples/AddReferenceGenome.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Configuring reference genome

For GC correction and 1000 genome strict mask filtering CNVpytor uses information
related to the reference genome. With installation two reference genomes are
available: hg19 (GRCh37) and hg28 (GRCh38).

If you want to use other reference genome for human or other species first we have
to create GC and mask file (optional).

In this example we will configure mouse reference genome MGSCv37.

To create GC file we need sequence of the reference genome in fasta.gz file:

```
> cnvpytor -root MGSCv37_gc_file.pytor -gc ~/hg19/mouse.fasta.gz -make_gc_file
```

This command will produce _MGSCv37_gc_file.pytor_ file that contains information about
GC content in 100-base-pair bins.

For reference genomes where we have strict mask in the same format as 100 Genomes Project
[strict mask](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/working/20160622_genome_mask_GRCh38/),
we can create mask file using command:

```
> cnvpytor -root MGSCv37_mask_file.pytor -mask ~/hg19/mouse.strict_mask.whole_genome.fasta.gz -make_mask_file
```

If we do not have mask file, we can skip this step. Mask file contains information about
regions of the genome that are more accessible to next generation sequencing methods
using short reads. CNVpytor uses P marked positions to filter SNP-s and read depth signal.
If reference genome configuration does not contain mask file, CNVpytor will still be fully functional,
apart from the filtering step.
You may also generate your own mask file by creating fasta file that contains character "P" if corresponding
base pair passes the filter and any character different than "P" if not.

Now, we will create example_ref_genome_conf.py file containing following:

```
import_reference_genomes = {
"mm9": {
"name": "MGSCv37",
"species": "Mus musculus",
"chromosomes": OrderedDict(
[("chr1", (197195432, "A")), ("chr2", (181748087, "A")), ("chr3", (159599783, "A")),
("chr4", (155630120, "A")), ("chr5", (152537259, "A")), ("chr6", (149517037, "A")),
("chr7", (152524553, "A")), ("chr8", (131738871, "A")), ("chr9", (124076172, "A")),
("chr10", (129993255, "A")), ("chr11", (121843856, "A")), ("chr12", (121257530, "A")),
("chr13", (120284312, "A")), ("chr14", (125194864, "A")), ("chr15", (103494974, "A")),
("chr16", (98319150, "A")), ("chr17", (95272651, "A")), ("chr18", (90772031, "A")),
("chr19", (61342430, "A")), ("chrX", (166650296, "S")), ("chrY", (15902555, "S")),
("chrM", (16299, "M"))]),
"gc_file":"/..PATH../MGSCv37_gc_file.pytor",
"mask_file": "/..PATH../MGSCv37_mask_file.pytor"
}
}
```

Last line can be skipped, if there is no mask file.

To use CNVpytor with new reference genome us -conf option in each cnvpytor command, e.g.
```
cnvpytor -conf REL_PATH/example_ref_genome_conf.py -root file.pytor -rd file.bam
```

CNVpytor will use chromosome lengths from alignment file to detect reference genome.
However, if you configured reference genome after you had already run -rd step you
could assign reference genome using -rg:
```
cnvpytor -conf REL_PATH/example_ref_genome_conf.py -root file.pytor -rg mm9
```

To avoid typing "-conf REL_PATH/example_ref_genome_conf.py" each time you run cnvpytor,
you can create an alias. However, we would like to encourage you to send us configuration,
gc and mask file and we would be glad to include it into the CNVpytor code. Or, even better,
fork the repository on GitHub, add configuration in cnvpytor/genome.py, data files in cnvpytor/data
and create a pull request.


0 comments on commit 1f50ad7

Please sign in to comment.