``CNVExpo: Python package for detection of copy number variation in next generation sequencing data
CNVExpo uses mapped and indexed bam file as input. The user provides a target bed file defining the chromosomal positions, which should be analysed. Bcftools mpileup is used to determine the read depth per position. Optional clustering of samples is done based on an arbitrarily chosen control gene.
The structure of files in CNVExpo is shown below.
├── cnvcal.py
├── data
│ ├── ccds.gtf
│ ├── control.bed
│ ├── GRCh38_full_analysis_set_plus_decoy_hla.fa (GRCh38_full_analysis_set_plus_decoy_hla.fa (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa)
│ └── GRCh38_full_analysis_set_plus_decoy_hla.fa.fai
└── scripts
├── cnvest.py
├── cnvplot.py
├── cnvtab.py
├── cnvvis.py
├── findclust.py
├── igv.py
├── workdir
│ ├── input
│ │ ├── bed file
│ │ ├── gene list
│ └──── sample list
│ ├── output
│ │ ├── vcf
└── └──── report
CNVExpo requires the following libraries to be installed and is checked in Linux environment with the version provided below
Requirements:
Python==3.7.13
pysam==0.19.0
pandas==1.3.5
bcftools==1.9
cyvcf2==0.30.15
click==7.1.2
numpy==1.21.5
scipy==1.7.3
sklearn==0.0
vcf-parser==1.6
pybedtools==0.8.1
The following files are required to be kept in folder data GRCh38_full_analysis_set_plus_decoy_hla.fa, GRCh38_full_analysis_set_plus_decoy_hla.fa.fai
To install use command
git clone https://github.com/sysbiocoder/CNVExpo.git
Run from folder CNVExpo
Create a project_folder, which would be the working directory
Inside project folder make a directory input and store bedfiles, sample list and gene list.
CNVExpo also requires location of the bam file named as ".bam"
• Generating genes/control bedfile:
The bed file is supposed to have 5 fields for each gene or regions delimited by space.
The fields required in the bedfile are given below:
Chr start end INFO/Genename strand biotype
Example bed file (control.bed):
chr1 11107485 11107500 MTOR - protein_coding
chr1 11108180 11108286 MTOR - protein_coding
chr1 11109289 11109370 MTOR - protein_coding
chr1 11109648 11109729 MTOR - protein_coding
• Generating sample list
Make a sample list text file with each sample per line
Example sample list (sample.txt)
22101
22102
22105
• Generating gene list
Make a gene list text file with each gene per line
Example gene list (gene.txt)
PCSK9
LDLR
APOB
To determine the background dataset, findclust.py script could be utilized.
It requires control bed file, bamfolder- the location of bam files, sample list, working directory.
Run the python script as below
python scripts/findclust.py --infile samplet.txt --workdir project_folder --bedfile test.bed --bamfolder bamfiles_locn –-threads number
It generates cluster folder inside input directory and different cluster lists with samples per each line in the cluster, which could be used as input for the cnv estimation step.
To estimate CNV, cnvcal.py script is utilized.
It requires bed file of the genes/region, bamfolder- the location of bam files, sample list (or the cluster file from previous step copied to the input directory), working directory.
Run the python script as below
python cnvcal.py --infile clusterfile --bedfile target.bed --bamfolder bamfiles_locn --workdir project_folder –-threads number
It generates vcf files, html report and depth_cal.txt for each samples.
To visualize and explore CNV, cnvvis.py can be used
It requires gene lists, sample lists (Maximum 10 samples) bamfolder- the location of bam files, and the working directory.
Run the python script as below
python cnvvis.py --genelist genes_list.txt --samplelist cluster_1_samples --bamfolder bamfiles_locn --workdir project_folder
It generates visualization tool for the gene list and sample list provided.