ASCAT is a software for performing allele-specific copy number analysis of tumor samples and for estimating tumor ploidy and purity (normal contamination). ASCAT is written in R and available here: github.com/Crick-CancerGenomics/ascat.
To run ASCAT on NGS data we need BAM files for the tumor and normal samples, as well as a loci file with SNP positions. If ASCAT is run on SNP array data, the loci file contains the SNPs on the chip. When runnig ASCAT on NGS data we can use the same loci file, for exampe the one corresponding to the AffymetrixGenome-Wide Human SNP Array 6.0, but we can also choose a loci file of our choice with i.e. SNPs detected in the 1000 Genomes project.
Running ASCAT on NGS data requires that the BAM files are converted into BAF and LogR values. This can be done using the software AlleleCount followed by a simple R script. AlleleCount extracts the number of reads in a BAM file supporting each allele at specified SNP positions. Based on this, BAF and logR can be calculated for every SNP position i as:
BAFi(tumor)=countsBi(tumor)/(countsAi(tumor)+countsBi(tumor))
BAFi(normal)=countsBi(normal)/(countsAi(normal)+countsBi(normal))
LogRi(tumor)=log2((countsAi(tumor)+countsBi(tumor))/(countsAi(normal)+countsBi(normal)) - median(log2((countsA(tumor)+countsB(tumor))/(countsA(normal)+countsB(normal)))
LogRi(normal)=0
For male samples, the X chromosome markers have special treatment:
LogRi(tumor)=log2((countsAi(tumor)+countsBi(tumor))/(countsAi(normal)+countsBi(normal))-1 - median(log2((countsA(tumor)+countsB(tumor))/(countsA(normal)+countsB(normal))-1)
where: i corresponds to the postions of all SNPs in the loci file. CountsA and CountsB are vectors containing number of reads supporting the A and B alleles of all SNPs A = the major allele B = the minor allele Minor and major alleles are defined in the loci file (it actually doesn't matter which one is defied as A and B in this application)
Calculation of LogR and BAF based on AlleleCount output is done as in runASCAT.R in the ascatNgs repository on Github.
The loci file was created based on the 1000Genomes latest release (phase 3, releasedate 20130502), available here. The following filter was applied: Only bi-allelc SNPs with minor allele frequencies > 0.3.
The loci file was originally generated for GRCh37. It was translated into GRCh38 using the tool liftOver available at the UCSC Genome Browser. To run liftOver the loci file was first written in bed format:
awk '{print "chr"$1":"$2"-"$2}' 1000G_phase3_20130502_SNP_maf0.3.loci > 1000G_phase3_20130502_SNP_maf0.3.bed
Using the web interface to liftOver at genome.ucsc.edu the file was translated into GRCh38 coordinates. LiftOver was possible for 3261270 out of 3268043 SNPs. The converted SNP positions were printed in the format required by AlleleCounter by:
more hglft_genome_5834_13aba0.bed | awk 'BEGIN{FS="chr"} {print $2}' | awk 'BEGIN{FS="-"} {print $1}' | awk 'BEGIN{FS=":";OFS="\t"} {print $1,$2}' > 1000G_phase3_GRCh38_maf0.3.loci
Input files for ASCAT's GC correction were created for the above loci files, using the scripts and instructions for this on ASCAT's github repository
The following scripts were downloaded from https://github.com/Crick-CancerGenomics/ascat/tree/master/gcProcessing:
- createGCcontentFile.R
- createWindowBed.pl
- GCfileCreation.sh.
To generate the GC correction file additional files are needed:
- locifile described above
- reference.fasta is the genome reference file in fasta format
The files are descibed in Genomes and reference files documentation
- chromosomesizes.txt is a tab delimited text file containing the size of all chromosomes included in the loci file.
An example file is available in https://github.com/Crick-CancerGenomics/ascat/tree/master/gcProcessing/hg19.chrom.sizes
The genomc reference file we use in Sarek for GRCh37 is coded without "chr" in the chromosome names, while the genome reference file we use in Sarek for GRCh38 includes "chr" in the chromosome names.
The script createWindowBed.pl assumes that chr
is included in the chromosome names of the reference file, so a small modification of this script was done for the process to work on our GRCh37 loci file.
These two lines in createWindowBed.pl generate output (lines 61 and 64):
(61) print OUT "chr".$tab[1]."\t".$start."\t".$stop."\t".$tab[0]."\t".$tab[2]."\t".($w*2+1)."\n";
(64) print OUT "chr".$tab[1]."\t".$start."\t".$stop."\t".$tab[0]."\t".$tab[2]."\t".($w*2)."\n";
and were changed to:
(61) print OUT $tab[1]."\t".$start."\t".$stop."\t".$tab[0]."\t".$tab[2]."\t".($w*2+1)."\n";
(64) print OUT $tab[1]."\t".$start."\t".$stop."\t".$tab[0]."\t".$tab[2]."\t".($w*2)."\n";
After this modification the script works for our GRCh37 loci file.
The following sbatch script was run on the Uppmax cluster Rackham, to generate the CG correction files:
#!/bin/bash -l
#SBATCH -A projid
#SBATCH -p node
#SBATCH -t 24:00:00
#SBATCH -J createGCfile
module load bioinfo-tools
module load BEDTools
module load R/3.5.0
module load R_packages/3.5.0
./GCfileCreation.sh 1000G_phase3_20130502_SNP_maf0.3.loci chrom.sizes 19 human_g1k_v37_decoy.fasta
where:
- 1000G_phase3_20130502_SNP_maf0.3.loci is the loci file for GRCh37 described above
- human_g1k_v37_decoy.fasta is the genome reference file used for GRCh37
- chrom.sizes is the list of the chromosome lengths in GRCh37
Names of the chromosomes in chrom.sizes file must be the same as in the genome reference, so in case of GRCh37 we used "1", "2" etc and in GRCh38 we used "chr1", "chr2" etc.
- 19 means that 19 cores are available for the script.
This created GC correction files with the following column headers:
Chr Position 25 50 100 200 500 1000 2000 5000 10000 20000 50000 100000 200000 500000 1M 2M 5M 10M
This file gave an error when running ASCAT, and the error message suggested that it had to do with the column headers. The Readme.txt in https://github.com/Crick-CancerGenomics/ascat/tree/master/gcProcessing suggested that the column headers should be:
Chr Position 25bp 50bp 100bp 200bp 500bp 1000bp 2000bp 5000bp 10000bp 20000bp 50000bp 100000bp 200000bp 500000bp 1M 2M 5M 10M
The column headers headers of the generated GC correction files were therefore manually edited.
The final files are tab-delimited with the following columns (and some example data):
Chr Position 25bp 50bp 100bp 200bp 500bp 1000bp 2000bp 5000bp 10000bp 20000bp 50000bp 100000bp 200000bp 500000bp 1M 2M 5M 10M
snp1 1 14930 0.541667 0.58 0.61 0.585 0.614 0.62 0.6 0.5888 0.588 0.4277 0.395041 0.380702 0.383259 0.341592 0.339747 0.386343 0.500537 0.511514
snp2 1 15211 0.625 0.64 0.67 0.63 0.61 0.612 0.6135 0.591 0.5922 0.4358 0.39616 0.380411 0.383167 0.34163 0.339771 0.386417 0.500558 0.511511
snp3 1 15820 0.541667 0.56 0.62 0.655 0.65 0.612 0.5885 0.5936 0.5797 0.4511 0.397771 0.379945 0.382999 0.341791 0.339832 0.386554 0.500579 0.511504
The ASCAT process gives several images as output, described in detail in this book chapter. The script also gives out a text file (tumor.cnvs.txt) with information about copy number state for all the segments predicted by ASCAT. The output is a tab delimited text file with the following columns:
chr startpos endpos nMajor nMinor
Where:
- chr is the chromosome number
- startpos is the start position of the segment
- endpos is the end position of the segment
- nMajor is number of copies of one of the allels (for example the chromosome inherited from the father)
- nMinor is the number of copies of the other allele (for example the chromosome inherited of the mother)
The file tumor.cnvs.txt contains all segments predicted by ASCAT, both those with normal copy number (nMinor = 1 and nMajor =1) and those corresponding to copy number aberrations.