Step 3.3: Untargeted mode

This approach does the following:

generates multiple models, one per truth set for the training set, from single window data;
scans each chromosome array (in 200 bp non-overlapping bins) to generate the predictions for SV_start, SV_end and noSV;
combine the predictions and compare the results with the truth set of the test set.

The scripts to run are summarized in this SGE script:

T0_S1_generate_training_data.py The first argument of this script is either:

'positive': generate the positive set, with labels SV_start and SV_end
'negative': generate the negative set, with label noSV

The other arguments are:

chrlist: list of chromosome to generate the positive/negative set for;
win: window size to use for the single windows;
truthset: SV callset to generate the training data from. Either a truth set or a SV callset from one of the sv-callers in the sv-callers workflow
inputdir: the directory containing the chr_array folder with the bcolz chromosome arrays
output: Numpy .npz file to write the output to

T0_S2_train.py Script used to train the model from the training data generated in step T0_S1. Input arguments are:

positive: Numpy .npz file of the positive training set (SVs)
negative: Numpy .npz file of the negative training set (noSV instances)
output: output file for the model in .hdf5 format

T0_S3_scan_chromosome.py Script used to scan a chromosome array and generate the predictions using the model created in step T0_S2. Input arguments are:

inputdir: the directory containing the chr_array folder with the bcolz chromosome arrays
chr: name of the chromosome to scan;
window: window size to split the chromosome array into, the same used in step T0_S1;
shift: offset to use for the position of non-overlapping windows. 0 means that the windows are starting from the first chromosome position. It should be in a range of [0,(window-1)]
model: model file from step T0_S2 to use for generating the predictions;
output: Numpy .npz output file with the predictions:
- start: start positions predicted for the SV type
- end: end positions predicted for the SV type
- probs: posterior probabilities for the predictions generated by the Softmax layer of the model

T0_S4_compare.py This script aggregates the predictions of step T0_S3 across all chromosomes and compares them with the truth set. Input arguments are:

truthset: SV truth set to use for the comparison;
chrlist: list of chromosomes to consider in the analysis;
win: window size used in step T0_S1 and T0_S3;
inputdirlist: list of input directories, one per model, containing the predictions for each chromosome
output: CSV file with output stats
outputbed: BED file with the final output positions to consider for downstream analysis (targeted assembly)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step 3.3: Untargeted mode

Clone this wiki locally