-
Notifications
You must be signed in to change notification settings - Fork 6
Step 3.3: Untargeted mode
Luca Santuari edited this page Mar 26, 2020
·
4 revisions
This approach does the following:
- generates multiple models, one per truth set for the training set, from single window data;
- scans each chromosome array (in 200 bp non-overlapping bins) to generate the predictions for SV_start, SV_end and noSV;
- combine the predictions and compare the results with the truth set of the test set.
The scripts to run are summarized in this SGE script:
T0_S1_generate_training_data.py The first argument of this script is either:
- 'positive': generate the positive set, with labels SV_start and SV_end
- 'negative': generate the negative set, with label noSV
The other arguments are:
- chrlist: list of chromosome to generate the positive/negative set for;
- win: window size to use for the single windows;
- truthset: SV callset to generate the training data from. Either a truth set or a SV callset from one of the sv-callers in the
sv-callers
workflow - inputdir: the directory containing the
chr_array
folder with the bcolz chromosome arrays - output: Numpy
.npz
file to write the output to
T0_S2_train.py Script used to train the model from the training data generated in step T0_S1. Input arguments are:
- positive: Numpy
.npz
file of the positive training set (SVs) - negative: Numpy
.npz
file of the negative training set (noSV instances) - output: output file for the model in
.hdf5
format
T0_S3_scan_chromosome.py Script used to scan a chromosome array and generate the predictions using the model created in step T0_S2. Input arguments are:
- inputdir: the directory containing the
chr_array
folder with the bcolz chromosome arrays - chr: name of the chromosome to scan;
- window: window size to split the chromosome array into, the same used in step
T0_S1
; - shift: offset to use for the position of non-overlapping windows. 0 means that the windows are starting from the first chromosome position. It should be in a range of
[0,(window-1)]
- model: model file from step T0_S2 to use for generating the predictions;
- output: Numpy
.npz
output file with the predictions:- start: start positions predicted for the SV type
- end: end positions predicted for the SV type
- probs: posterior probabilities for the predictions generated by the
Softmax
layer of the model
T0_S4_compare.py This script aggregates the predictions of step T0_S3 across all chromosomes and compares them with the truth set. Input arguments are:
- truthset: SV truth set to use for the comparison;
- chrlist: list of chromosomes to consider in the analysis;
- win: window size used in step T0_S1 and T0_S3;
- inputdirlist: list of input directories, one per model, containing the predictions for each chromosome
- output: CSV file with output stats
- outputbed: BED file with the final output positions to consider for downstream analysis (targeted assembly)