Skip to content

Latest commit

 

History

History
110 lines (99 loc) · 4.66 KB

README.md

File metadata and controls

110 lines (99 loc) · 4.66 KB

CNVExpo

``CNVExpo: Python package for detection of copy number variation in next generation sequencing data

CNVExpo uses mapped and indexed bam file as input. The user provides a target bed file defining the chromosomal positions, which should be analysed. Bcftools mpileup is used to determine the read depth per position. Optional clustering of samples is done based on an arbitrarily chosen control gene.

The structure of files in CNVExpo is shown below.

├── cnvcal.py
├── data
│   ├── ccds.gtf
│   ├── control.bed
│   ├── GRCh38_full_analysis_set_plus_decoy_hla.fa (GRCh38_full_analysis_set_plus_decoy_hla.fa (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa)
│   └── GRCh38_full_analysis_set_plus_decoy_hla.fa.fai
└── scripts
├── cnvest.py
├── cnvplot.py
├── cnvtab.py
├── cnvvis.py
├── findclust.py
├── igv.py
├── workdir
│ ├── input
│ │ ├── bed file
│ │ ├── gene list
│ └──── sample list
│ ├── output
│ │ ├── vcf
└── └──── report

Installation

CNVExpo requires the following libraries to be installed and is checked in Linux environment with the version provided below
Requirements:
Python==3.7.13
pysam==0.19.0
pandas==1.3.5
bcftools==1.9
cyvcf2==0.30.15
click==7.1.2
numpy==1.21.5
scipy==1.7.3
sklearn==0.0
vcf-parser==1.6
pybedtools==0.8.1

The following files are required to be kept in folder data GRCh38_full_analysis_set_plus_decoy_hla.fa, GRCh38_full_analysis_set_plus_decoy_hla.fa.fai

To install use command

git clone https://github.com/sysbiocoder/CNVExpo.git

Run from folder CNVExpo

Step 1: Input requirements

Create a project_folder, which would be the working directory
Inside project folder make a directory input and store bedfiles, sample list and gene list.
CNVExpo also requires location of the bam file named as ".bam"
• Generating genes/control bedfile:
The bed file is supposed to have 5 fields for each gene or regions delimited by space.
The fields required in the bedfile are given below:
Chr start end INFO/Genename strand biotype

Example bed file (control.bed):
chr1 11107485 11107500 MTOR - protein_coding
chr1 11108180 11108286 MTOR - protein_coding
chr1 11109289 11109370 MTOR - protein_coding
chr1 11109648 11109729 MTOR - protein_coding

• Generating sample list
Make a sample list text file with each sample per line

Example sample list (sample.txt)
22101
22102
22105

• Generating gene list
Make a gene list text file with each gene per line

Example gene list (gene.txt)
PCSK9
LDLR
APOB

Step 2: Clustering analysis

To determine the background dataset, findclust.py script could be utilized.
It requires control bed file, bamfolder- the location of bam files, sample list, working directory.
Run the python script as below

python scripts/findclust.py --infile samplet.txt --workdir project_folder --bedfile test.bed --bamfolder bamfiles_locn –-threads number


It generates cluster folder inside input directory and different cluster lists with samples per each line in the cluster, which could be used as input for the cnv estimation step.


Step 3: CNV Estimation

To estimate CNV, cnvcal.py script is utilized.
It requires bed file of the genes/region, bamfolder- the location of bam files, sample list (or the cluster file from previous step copied to the input directory), working directory.
Run the python script as below

python cnvcal.py --infile clusterfile --bedfile target.bed --bamfolder bamfiles_locn --workdir project_folder –-threads number

It generates vcf files, html report and depth_cal.txt for each samples.


Step 4: CNV Visualization

To visualize and explore CNV, cnvvis.py can be used
It requires gene lists, sample lists (Maximum 10 samples) bamfolder- the location of bam files, and the working directory.
Run the python script as below

python cnvvis.py --genelist genes_list.txt --samplelist cluster_1_samples --bamfolder bamfiles_locn --workdir project_folder

It generates visualization tool for the gene list and sample list provided.