Skip to content

Latest commit

 

History

History
159 lines (128 loc) · 6.01 KB

README.mason_variator.md

File metadata and controls

159 lines (128 loc) · 6.01 KB

Mason Variator

Overview

The program mason_variator is a stand-alone program for the simulation of SNPs, small indels, and structural variants into genomes. SNPs and small indels are simulated on a per-position probability. Structural variants can be simulated based on such rates as well, but also given as pairs of types and lengths from a TSV file.

The simulation result can be written out in a VCF file. Optionally, the program can also write out a FASTA file with sequence that includes the variants and the breakpoints of the variants in the simulated sequence.

Example

There are example files in the examples directory.

Help

The command:

  $ mason_variator --help

prints the help for Mason Variator

Simple SV Simulation

As an introduction, let us simulate some SVs given by their type and length from a TSV file (sv_sizes.tsv) into a small genome (adeno_virus.fa). We will write the variants into the VCF file adeno_out.vcf, the simulated sequence into the FASTA file adeno_out.fa, and the positions of the breakpoints in adeno_out.fa into the TSV file adeno_out.tsv. For more details on the file formats, see the last Section.

Note that for the human genome, you will want to use the option -n / --num-haplotypes to select the number of haplotypes/allelles to simulate.

  $ mason_variator -ir adeno_virus.fa -ov adeno_out.vcf -of adeno_out.fa \
      -it sv_sizes.tsv --out-breakpoints adeno_out.tsv
  ...

The output to the console should show you that 3 SNPs and 6 structural variants have been simulated. Now, have a look at the generated files.

We can now use the script breakpoint_contigs.awk (in folder "bin") to cut out the regions around the breakpoints in adeno_out.fa and write them into the file adeno_breakpoints.fa. Note that this requires you to have samtools installed.

We cut out the sequence around the breakpoints within a radius of 200bp towards each side:

  $ awk -f breakpoint_contigs.awk -v context=200 -v ref=adeno_out.fa \
      adeno_out.tsv > breakpoint_contigs.fa
  ...

These contigs around breakpoints could now for example be aligned to the original genome using STELLAR and the result then analyzed using GUSTAF:

  $ stellar adeno_virus.fa breakpoint_contigs.fa -o adeno_out.gff
  ...
  $ gustaf -st 1 -m adeno_out.gff adeno_virus.fa breakpoint_contigs.fa
  ...

File Formats

The program mason_variator uses the FASTA file for input and output sequences.

It writes out the variation in the VCF format with almost no special tags or values. See "VCF Format" section for a description of the INFO fields that are not already described in the VCF 4.1 standard.

The last two sections describe the TSV file format for specifying variations to simulate and the TSV format that gives the breakpoints in haplotype coordinates.

VCF Format

mason_variator creates VCF 4.1 files for the simulated variants. The format is described on the website of the 1000 genomes project [1]. mason_variator introduces an INFO key called TARGETPOS that allows compact representations of duplications.

SNPs are simulated such that all simulated haplotypes can contain the alternative base or the original base but at least one haplotype must have the SNP of course.

Small indels and structural variants are simulated such that only one haplotype can have the indel. No SNP or structural variant can overlap or be within one base with an SV breakpoint. SVs cannot overlap each other or be nested. Recombination can only happen within one contig/chromosome. These restrictions make simulation easier and yet are sufficient for the experimental evaluation of SV tools.

SNPs, small indels, and SV indels are given directly in the REF/ALT columns. SV indels are also marked as SVTYPE={INS,DEL}. Duplications are given as in the ALT column, inversions as . Translocations are given as 6 breakend BND entries, 2 for each of the breakpoints.

[1] http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

Variation Selection TSV File

The user can give the path to a TSV file with the --in-variant-tsv option. The first two columns are mandatory. The first column gives the variation type to simulate and the second column gives the length of the variant. Comment lines begin with a hash (#). For insertions, you can also give a third column with the sequence to insert. Note that if this sequence is given then the length of the sequence takes precedence over the the length in the second column.

The following variation types are interpreted:

  • INS: An insertion.
  • DEL: A deletion.
  • INV: An inversion.
  • CTR: An intra-chromosomal translocation.
  • DUP: A duplication

An example for the content of such a file is given below:

  #type length
  INS   100
  DEL   1000
  CTR   400
  INS   20     CGATTTAGATATACGTATAC

This file makes the program simulate a 100bp insertion with random characters, a 1000bp deletion, a 400bp intra-chromosomal translocation, and a 20bp insertion with specified sequence. Note that when the --in-variant-tsv parameter is given, structural variants are not simulated on a per-position rate.

Breakpoint TSV File

The VCF file written by mason_variator gives the coordinates in the reference. However, it might also be of interest to easily retrieve the coordinates of breakpoints on the haplotypes with variants.

For this, you can give the parameter --out-breakpoints with a path to a TSV file to write. This file contains three columns, one for the contig/reference name, one for the variant id, and one for the (1-based) position on the contig.

The following shows an example with 4 breakpoints, two on each haplotype, spread over three chromosomes. Note that there are two breakpoints for the insertion on chrI, also two for the inversion on chrII and one breakpoint for the deletion on chrIII. Missing variant names are given with dots (".").

  #ref     id             pos
  chrI/1   sim_sv_indel_1 2051
  chrI/1   sim_sv_indel_1 2351
  chrII/0  sim_inv_1      2451
  chrII/0  sim_inv_2      3573
  chrIII/1 sim_sv_indel_2 9297