Skip to content

Commit

Permalink
vep
Browse files Browse the repository at this point in the history
  • Loading branch information
zqfang committed Aug 19, 2024
1 parent 19a0ccb commit e5b9b47
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 26 deletions.
19 changes: 5 additions & 14 deletions haplomap/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,31 +18,21 @@ See [variant calling](../workflows/README.md) in the workflow folder using GATK,

Note: You need to use Ensemble-vep to annotate your VCF files.

the required flags: `--variant_class`, `--symbol`, `--individual_zyg all`, `--tab`
the required flags: `--variant_class`, `--symbol`, `--individual_zyg all`, `--tab`, `--fasta`

This code snap works for haplomap
```shell
## ensembl-vep version 101
bcftools view -f .,PASS ${vcf} | \
vep --fasta ${reference} ${genome_build} \
--format vcf --fork ${threads} --hgvs --force_overwrite \
--uniprot --domains --symbol --regulatory --distance 1000 --biotype \
--gene_phenotype MGI --check_existing --pubmed --numbers \
--offline --cache --variant_class \
--gencode_basic --no_intergenic --individual all \
-o ${output} --tab --compress_output gzip \

## for SV: https://useast.ensembl.org/info/docs/tools/vep/vep_formats.html#sv
## For SV input format: https://useast.ensembl.org/info/docs/tools/vep/vep_formats.html#sv
## ensembl-vep version 111
bcftools view -f PASS ${vcf} | \
bcftools view -f .,PASS ${vcf} | \
vep -a GRCm38 --species mus_musculus --refseq --cache --cache_version 102 \
--offline --compress_output gzip -o ${output} --tab \
--fork 16 --offline --uniprot --cache --format vcf \
--force_overwrite -overlaps --plugin TSSDistance --domains \
--plugin phenotypes --symbol --canonical --variant_class \
--nearest gene --regulatory --distance 5000 \
--individual_zyg all --no_check_variants_order \
--max_sv_size 100000
--max_sv_size 100000 --fasta GRCm38.fa
```


Expand Down Expand Up @@ -122,6 +112,7 @@ python scripts/annotateSNPs.py test_strains.txt chr18.txt \
build/bin/haplomap eblocks -a ${HOME}/data/SNPS/chr18.txt \
-g ${HOME}/data/chr18.annotation.txt \
-s ${HOME}/TMPDATA/test_strains.txt \
-p ${HOME}/TMPDATA/test.SNPs.vb.txt \
-o ${HOME}/TMPDATA/test.SNPs.hb.txt
```

Expand Down
39 changes: 27 additions & 12 deletions haplomap/src/vep.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -135,10 +135,24 @@ void VarirantEeffectPredictor::readHeader(char *inFileName, char *delemiter)
{
this->hasIND = false;
}
if (columns.find("VARIANT_CLASS") == columns.end())
// required column from VEP
std::vector<std::string> req = {"Uploaded_variation", "Location", "Allele", "Feature",
"Consequence","Amino_acids", "Codons", "ZYG",
"IMPACT", "VARIANT_CLASS", "SYMBOL"};
for (auto & r: req)
{
std::cerr<<"VARIANT_CLASS column is missing, please re-run vep with `--variant_class`\n";
std::exit(1);
if (columns.find(r) == columns.end())
{
std::cerr<<"column: "<<r<<" is missing, please re-run vep\n";
std::cerr<<"An example vep command: \n\n"
"bcftools view -f .,PASS ${vcf} | "
"vep -a GRCm38 --species mus_musculus --refseq --cache --cache_version 102 "
"--offline --compress_output gzip -o ${output} --tab --fasta GRCm38.fa "
"--fork 16 --offline --cache --format vcf --individual_zyg all "
"--force_overwrite -overlaps --regulatory "
"--symbol --canonical --variant_class "<<std::endl;
std::exit(1);
}
}

}
Expand All @@ -157,20 +171,21 @@ std::string VarirantEeffectPredictor::set_key(std::string location, std::string
std::string key;
std::string chrom = location.substr(tok0, tok1);

/// For ensembl-vep results, the coordinates (chrStart) of
/// indel, deletion need to -1 to get original position in vcf.
/// SNV, insertion stay the same to position in vcf
/// For ensembl-vep cooridnates, the coordinates (chrStart) of indel, deletion
/// need to -1 to match to original coordinates in vcf, since VEP trim 1 preceding base.
/// SNV, insertion stay the same to coordinates in vcf
/// ref here for details: https://useast.ensembl.org/info/docs/tools/vep/vep_formats.html

/// The quick trick to handle these cases is whether the location string contains "-"
if (tok2 == std::string::npos) // no match found
{
start = std::stoi(location.substr(tok1 + 1, sz-tok1)); // (pos, len)
if (tok2 == std::string::npos) // no match found, it's a snv
{ // (pos, len)
start = std::stoi(location.substr(tok1 + 1, sz-tok1));
end = start;
if (var_class == "indel") start --; // indel (dup) with "chr:start" format need to minus 1
}
else
{
// NOTE: need to minius 1, since VEP made pos+1 in their annotatoin
start = std::stoi(location.substr(tok1 + 1, tok2 - tok1)) - 1;
{ // '-' is found, means it's a indel or SV
start = std::stoi(location.substr(tok1 + 1, tok2 - tok1)) - 1; // match to original cooridnates in vcf
if (var_class == "insertion") start ++; // only insertion case are stay the same pos as original vcf
end = std::stoi(location.substr(tok2 + 1, sz - tok2)); // empty string if snp
}
Expand Down

0 comments on commit e5b9b47

Please sign in to comment.