Skip to content

Step 3.3: Untargeted mode

Luca Santuari edited this page Mar 26, 2020 · 4 revisions

This approach does the following:

  1. generates multiple models, one per truth set for the training set, from single window data;
  2. scans each chromosome array (in 200 bp non-overlapping bins) to generate the predictions for SV_start, SV_end and noSV;
  3. combine the predictions and compare the results with the truth set of the test set.

The scripts to run are summarized in this SGE script:

T0_S1_generate_training_data.py The first argument of this script is either:

  • 'positive': generate the positive set, with labels SV_start and SV_end
  • 'negative': generate the negative set, with label noSV

The other arguments are:

  • chrlist: list of chromosome to generate the positive/negative set for;
  • win: window size to use for the single windows;
  • truthset: SV callset to generate the training data from. Either a truth set or a SV callset from one of the sv-callers in the sv-callers workflow
  • inputdir: the directory containing the chr_array folder with the bcolz chromosome arrays
  • output: Numpy .npz file to write the output to

T0_S2_train.py Script used to train the model from the training data generated in step T0_S1. Input arguments are:

  • positive: Numpy .npz file of the positive training set (SVs)
  • negative: Numpy .npz file of the negative training set (noSV instances)
  • output: output file for the model in .hdf5 format

T0_S3_scan_chromosome.py Script used to scan a chromosome array and generate the predictions using the model created in step T0_S2. Input arguments are:

  • inputdir: the directory containing the chr_array folder with the bcolz chromosome arrays
  • chr: name of the chromosome to scan;
  • window: window size to split the chromosome array into, the same used in step T0_S1;
  • shift: offset to use for the position of non-overlapping windows. 0 means that the windows are starting from the first chromosome position. It should be in a range of [0,(window-1)]
  • model: model file from step T0_S2 to use for generating the predictions;
  • output: Numpy .npz output file with the predictions:
    • start: start positions predicted for the SV type
    • end: end positions predicted for the SV type
    • probs: posterior probabilities for the predictions generated by the Softmax layer of the model

T0_S4_compare.py This script aggregates the predictions of step T0_S3 across all chromosomes and compares them with the truth set. Input arguments are:

  • truthset: SV truth set to use for the comparison;
  • chrlist: list of chromosomes to consider in the analysis;
  • win: window size used in step T0_S1 and T0_S3;
  • inputdirlist: list of input directories, one per model, containing the predictions for each chromosome
  • output: CSV file with output stats
  • outputbed: BED file with the final output positions to consider for downstream analysis (targeted assembly)