These scripts reduce the number of calls made by the Parliment SV caller by:
- Removing SV calls which did not pass QC
- Removing SV calls which did not have >=2 callers agree on the SV
- Removing regions which appear in DGV Gold Standard variants
- Filtered VCFs are used to determine how often SVs are found in regions of interest (transcript_annotated_genes_of_interest.bed)
Try to discover additional significant SVs in genes which aren't in the genes of interest (gene discovery)
- Take filtered VCFs & remove all regions which occur in regions of interest
- Annotate the genes these SVs appear in
- Create an out which shows all the gene annotated regions per patient SV occurred in
- Creates a summary document of genes/ collections of genes which occurred in >= 20 of the samples in the study
The below information details how the scripts in the repository work
- Take genes_of_interest.bed (multiple transcripts per gene of interest)
- Run biomaRt to add HGNC ID and HGNC Symbol for each transcript Output: bed_file_with_gene.bed
DGV gold standard variants (taken from http://dgv.tcag.ca/dgv/docs/DGV.GS.hg38.gff3, release date 2016-05-15)
- Takes the thick regions for the coordinates for each variant (which has the highest confidence)
- Removes variants not seen at > 5.0% of the population to align with ACMG guidance for frequency of likely benign variants in the population
This script performs three separate tasks
- Use bedtools merge to merge regions which are < 50 bp distance from each other
- Use bedtools merge to merge regions which overlap, giving the largest transcript per gene of interest
- Use bedtools intersect to add the gene symbols on to the largest transcript for each gene
- Remove SVs which haven't passed quality filtering
- Remove SVs not called by at least two callers
- Find results from filtered VCF which overlap with regions in the DGV Gold file. 50% overlap of DGV gold & variant from VCF must be met
- Find results from filtered VCF which do not overlap with regions in the DGV Gold file. 50% overlap of DGV gold & variant from VCF must be met
- Filter regions based on genes of interest bed file (transcript_annotated_genes_of_interest.bed)
- Count all the rows, per sample across the different file outputs
- Save as a csv file
- Pull in all the different data into dataframes
- Set them to output on a separate tab, of a single .xlxs sheet per sample
- Save!
- Takes a file of all samples intersected with transcript_annotated_genes_of_interest.bed
- Count the number of samples with SVs in each region
- Create a dataframe of samples with SVs on a per gene basis
- Pull in all the different genes of interest data into dataframes
- Set them to output on a separate tab of a single .xlxs sheet
- Save!
- Intersect regions which aren't in DGV gold & find regions which don't appear in genes of interest (transcript_annotated_genes_of_interest.bed)
- Prepares data to be merged, splitting on SV type (deletion or inversion) and genotype
There's a lot of repeated SV regions in the output file, where one SV is repeated many times with a unique ID
- Sort the files saved in the previous step
- Merge regions which overlap by at least one base pair
- Stick the different genotypes per SV type back together and sort again
- Search ensemble using the genomic coordinates for each merged SV, to find the gene/s associated with that region
- Removes the regions which aren't associated with a gene
The step above (annotation with ensemble) removes all the information given from Parliment2, this section puts the genes from the previous step on to that date
- Use bedtools intersect to see which regions from the merged file (with the unique IDs) intersect with the genes found in the previous step
- Merge the deletions and inversions back together
- Remove unnecessary columns
- Puts all the genes for the merged regions into one row per SV regions & the associated genes
- Puts all the genes (and combinations of genes per for larger SVs) into one dataframe, with the patients ID
- Counts regions which overlap with each other
- Splits dataframes by count of samples the SV occurred in and saves
- Pull in the output from the last script
- Save as tabs of the sample xlxs sheet
QUAL: - LowQual,Description="Variant calls with this profile of supporting calls typically have a low overall precision"> - Unknown,Description="Insufficient quality evidence exists for calls of this type and support"> - Unconfirmed,Description="It was not possible to confirm this event by genotyping">
CIEND: - Description="PE confidence interval around END"
CIPOS: - Description="PE confidence interval around POS"
CHR2: -Description="Chromosome for END coordinate in case of a translocation"
END: - Description="End position of the structural variant"
AVGLEN: - Description="Length of the SV"
SVMETHOD: - Description="Method for generating this merged VCF file."
SVTYPE: - Description="Type of the SV."
SUPP_VEC: - Description="Vector of supporting samples."
SUPP: -Description="Number of samples supporting the variant"
STRANDS: - Description="Indicating the direction of the reads with respect to the type and breakpoint."
CALLERS: - Description="Callers that support an ALT call at this position. To be included, the caller must have been confirmed by separate genotyping with SVTyper"
GT: - Description="Genotype"
- Rows per different file type (denoting the filtering steps)
Unfiltered VCF: This show all the variants in the VCF before filtering
Filtered VCF: Variants remaining after filtering. QUAL must be PASS, >= two callers supporting SV.
No DGV overlap regions: this is the filtered VCF from the above step, but variants remaining do not have >=50% overlap with known DGV Gold standard SV.
BED file overlap regions: Filtered VCF, but only showing regions which overlap with the transcript_annotated_genes_of_interest.bed file
- Count of SV per gene: counts the number of SVs detected per SV type, per gene
- Gene tabs: Show SV genomic coordinate which has occurred in each gene, and other sample information
- Genomic regions, with overlapping SVs (with the same genotype) merged and the gene/s on a per patient basis
- genes, count (in the sample population), SV type and IDs of the samples with this variant
- Each tab represent SVs which occured in different numbers of samples in the study