Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended standalone #63

Open
wants to merge 99 commits into
base: postprocess_dedup_filter
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
9e14ec5
added new scripts for extending standalone features
Mar 6, 2020
8720927
fix for dbsnp
Mar 9, 2020
046b123
Merge branch 'postprocess_dedup_filter' into extended_standalone
Mar 11, 2020
9cb9960
fix_extract_ensemble
Mar 12, 2020
19b8512
Merge branch 'postprocess_dedup_filter' into extended_standalone
Mar 12, 2020
58ccf62
fix dirnames
Mar 12, 2020
db2312e
Merge branch 'postprocess_dedup_filter' into extended_standalone
Mar 12, 2020
cab5ff7
small fix
Mar 14, 2020
c53f74e
Merge branch 'postprocess_dedup_filter' into extended_standalone
Mar 18, 2020
efa4190
fix features
Mar 19, 2020
0ed0e83
Merge branch 'postprocess_dedup_filter' into extended_standalone
Apr 10, 2020
e4be780
fix ensemble
Apr 10, 2020
8fc8583
backward compatiblity for call.py
Apr 22, 2020
7514b31
Merge branch 'postprocess_dedup_filter' into extended_standalone
Apr 28, 2020
8e8ec69
few fixes
Apr 29, 2020
2a84509
fix format
Apr 29, 2020
5581ba4
small fix
Apr 29, 2020
115d814
fix for training loss
Apr 29, 2020
c848028
Merge branch 'postprocess_dedup_filter' into extended_standalone
Apr 29, 2020
a3c4f3e
fix for backward compatibility
Apr 29, 2020
55a498e
Merge branch 'postprocess_dedup_filter' into extended_standalone
Apr 29, 2020
f9ee725
fix train loss
Apr 30, 2020
b09f5f6
Merge branch 'postprocess_dedup_filter' into extended_standalone
Apr 30, 2020
26c4ca4
fix extend_features
May 3, 2020
757dfab
improve efficiency in extend_features
May 3, 2020
776ddce
fix for extend_features
May 4, 2020
15ff4a8
added seq_complexity
May 5, 2020
2c4f45d
fix ensemble
May 6, 2020
e913e83
more efficient LC
May 6, 2020
2be51cc
small fix
May 6, 2020
e9b83da
filter duplicate by default
May 6, 2020
a54be69
switched fisher test
May 7, 2020
4f6bf57
fix seq_complexity
May 9, 2020
dbe9c4a
fix num fields
May 9, 2020
cb64681
zero anns columns added
May 10, 2020
97d16f6
fix bug in read_info_extractor.py as in somaticseq
May 10, 2020
018e87a
cluster variants for feature extraction
May 12, 2020
0e394ff
small fix
May 12, 2020
3c061b3
record aligned_pairs
May 12, 2020
709b64b
small fix
May 12, 2020
12167e1
more efficient read/ref pos match search
May 13, 2020
ff00be3
input num_splits
May 13, 2020
366964e
max_cluster size added
May 13, 2020
839cc63
better memory management for feature extraction
May 13, 2020
0501883
not to store aligned_pairs
May 13, 2020
90a68da
small fix
May 13, 2020
f83b6b5
enable custom header
May 15, 2020
49c809f
fixed a bug
May 15, 2020
2f9905b
fix bug in region splitting
May 15, 2020
c4f24ac
small fix
May 15, 2020
1ce3935
small_fix
May 15, 2020
8b29757
small fix
May 15, 2020
654b90a
Merge branch 'extended_standalone' into custom_header
May 15, 2020
7b0cb75
small fix
May 15, 2020
7706c3b
enable custom heading
May 16, 2020
27f98c8
small fix
May 16, 2020
0bc4655
small fix
May 17, 2020
7deb7d6
small fix
May 23, 2020
7897df8
small fix
May 28, 2020
8ac67a1
fix ann
May 31, 2020
607abcd
small fix
Jun 5, 2020
867ba5f
added callers vcf to tsv
Jun 18, 2020
635341e
merge regions for scanning
Jun 18, 2020
a52d0c2
bug fixes for call/post
Jun 19, 2020
2094b3e
fix_bugs
Jun 19, 2020
30ac6cf
updated versions to 0.3.0
Jun 19, 2020
c2deecb
small fix
Jun 19, 2020
bad8004
small fix
Jun 19, 2020
97e4864
fix test
Jun 20, 2020
fb6ea21
fix in resolve variants
Jun 24, 2020
1bf40b4
ensemble with internal features
Jun 25, 2020
2471f8e
fix resolve
Jun 25, 2020
544e2f2
small fix
Jul 2, 2020
96d2091
small fix
Jul 2, 2020
fd2a2f7
repeat extension
Jul 19, 2020
cca94c8
small fix
Jul 24, 2020
b4d2bdf
improve cpu multi-thread call.py
Jul 29, 2020
a359327
small fix
Jul 30, 2020
f6d3174
updated dockerfile
Jul 30, 2020
accb759
updated docker test
Jul 30, 2020
9f5d876
fix build
Sep 19, 2020
e500bf7
small_fix
Oct 6, 2020
e48e15d
force cov_thr
Nov 7, 2020
a96a236
fix max_cov
Nov 9, 2020
d804742
fixed matrices gradual delete
Dec 6, 2020
45fbcd6
fix generate_dataset
Dec 29, 2020
cff3653
fix ensemble rounding
Jan 22, 2021
603d582
reduce disc I/O while calling
Jan 23, 2021
d8738f0
Updated README
Jan 26, 2021
e8f9f04
fix README
Jan 26, 2021
088c845
added uint16 as an option for input matrices
Mar 5, 2021
47ac4da
fixed uint16
Mar 5, 2021
d948d08
added report_all_alleles
Mar 8, 2021
798c880
added strict_labeling
Mar 8, 2021
cdfc062
fixed strict_labeling
Mar 8, 2021
74a27df
fixed strict_labeling
Mar 8, 2021
a2eed5e
small fix
Mar 10, 2021
399b15c
fixed strict_labeling
Mar 10, 2021
13c45b7
small fix for strict_labeling
Mar 11, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions neusomatic/python/call.py
Original file line number Diff line number Diff line change
Expand Up @@ -395,7 +395,6 @@ def write_vcf(vcf_records, output_vcf, chroms_order, pass_threshold, lowqual_thr

def call_neusomatic(candidates_tsv, ref_file, out_dir, checkpoint, num_threads,
batch_size, max_load_candidates, pass_threshold, lowqual_threshold,
ensemble,
use_cuda):
logger = logging.getLogger(call_neusomatic.__name__)

Expand All @@ -412,7 +411,17 @@ def call_neusomatic(candidates_tsv, ref_file, out_dir, checkpoint, num_threads,

vartype_classes = ['DEL', 'INS', 'NONE', 'SNP']
data_transform = matrix_transform((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))

ensemble = False
with open(candidates_tsv[0]) as i_f:
for line in i_f:
x = line.strip().split()
if len(x) == 97:
msahraeian marked this conversation as resolved.
Show resolved Hide resolved
ensemble = True
break

num_channels = 119 if ensemble else 26
logger.info("Number of channels: {}".format(num_channels))
net = NeuSomaticNet(num_channels)
if use_cuda:
logger.info("GPU calling!")
Expand Down Expand Up @@ -607,7 +616,6 @@ def call_neusomatic(candidates_tsv, ref_file, out_dir, checkpoint, num_threads,
args.checkpoint,
args.num_threads, args.batch_size, args.max_load_candidates,
args.pass_threshold, args.lowqual_threshold,
args.ensemble,
use_cuda)
except Exception as e:
logger.error(traceback.format_exc())
Expand Down
378 changes: 378 additions & 0 deletions neusomatic/python/extend_features.py

Large diffs are not rendered by default.

43 changes: 29 additions & 14 deletions neusomatic/python/generate_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1346,7 +1346,7 @@ def find_records(input_record):
return None


def extract_ensemble(work, ensemble_tsv):
def extract_ensemble(ensemble_tsv, ensemble_bed, is_extend):
logger = logging.getLogger(extract_ensemble.__name__)
ensemble_data = []
ensemble_pos = []
Expand Down Expand Up @@ -1376,15 +1376,23 @@ def extract_ensemble(work, ensemble_tsv):
"tBAM_Other_Reads", "tBAM_Poor_Reads", "tBAM_REF_InDel_3bp", "tBAM_REF_InDel_2bp",
"tBAM_REF_InDel_1bp", "tBAM_ALT_InDel_3bp", "tBAM_ALT_InDel_2bp", "tBAM_ALT_InDel_1bp",
"InDel_Length"]
callers_features = ["if_MuTect", "if_VarScan2", "if_JointSNVMix2", "if_SomaticSniper", "if_VarDict", "MuSE_Tier",
"if_LoFreq", "if_Scalpel", "if_Strelka", "if_TNscope", "Strelka_Score", "Strelka_QSS",
"Strelka_TQSS", "VarScan2_Score", "SNVMix2_Score", "Sniper_Score", "VarDict_Score",
"M2_NLOD", "M2_TLOD", "M2_STR", "M2_ECNT", "MSI", "MSILEN", "SHIFT3"]

n_vars = 0
with open(ensemble_tsv) as s_f:
for line in s_f:
if not line.strip():
continue
if line[0:5] == "CHROM":
header_pos = line.strip().split()[0:5]
header = line.strip().split()[5:105]
header_ = line.strip().split()[5:]
if is_extend:
header_ += callers_features
header_en = list(filter(
lambda x: x[1] in expected_features, enumerate(line.strip().split()[5:])))
lambda x: x[1] in expected_features, enumerate(header_)))
header = list(map(lambda x: x[1], header_en))
if set(expected_features) - set(header):
logger.error("The following features are missing from ensemble file: {}".format(
Expand All @@ -1397,9 +1405,15 @@ def extract_ensemble(work, ensemble_tsv):
fields = line.strip().split()
fields[2] = str(int(fields[1]) + len(fields[3]))
ensemble_pos.append(fields[0:5])
features = fields[5:]
if is_extend:
features += ["0"] * len(callers_features)
ensemble_data.append(list(map(lambda x: float(
x.replace("False", "0").replace("True", "1")), fields[5:])))
ensemble_data = np.array(ensemble_data)[:, order_header]
x.replace("False", "0").replace("True", "1")), features)))
n_vars += 1
if n_vars > 0:
ensemble_data = np.array(ensemble_data)[:, order_header]
header = np.array(header)[order_header].tolist()

cov_features = list(map(lambda x: x[0], filter(lambda x: x[1] in [
"Consistent_Mates", "Inconsistent_Mates", "N_DP",
Expand Down Expand Up @@ -1479,14 +1493,14 @@ def extract_ensemble(work, ensemble_tsv):
]
selected_features = sorted([i for f in min_max_features for i in f[0]])
selected_features_tags = list(map(lambda x: header[x], selected_features))
for i_s, mn, mx in min_max_features:
s = ensemble_data[:, np.array(i_s)]
s = np.maximum(np.minimum(s, mx), mn)
s = (s - mn) / (mx - mn)
ensemble_data[:, np.array(i_s)] = s
ensemble_data = ensemble_data[:, selected_features]
ensemble_data = ensemble_data.tolist()
ensemble_bed = os.path.join(work, "ensemble.bed")
if n_vars > 0:
for i_s, mn, mx in min_max_features:
s = ensemble_data[:, np.array(i_s)]
s = np.maximum(np.minimum(s, mx), mn)
s = (s - mn) / (mx - mn)
ensemble_data[:, np.array(i_s)] = s
ensemble_data = ensemble_data[:, selected_features]
ensemble_data = ensemble_data.tolist()
with open(ensemble_bed, "w")as f_:
f_.write(
"#" + "\t".join(map(str, header_pos + selected_features_tags)) + "\n")
Expand Down Expand Up @@ -1523,7 +1537,8 @@ def generate_dataset(work, truth_vcf_file, mode, tumor_pred_vcf_file, region_be

split_batch_size = 10000
if ensemble_tsv and not ensemble_bed:
ensemble_bed = extract_ensemble(work, ensemble_tsv)
ensemble_bed = os.path.join(work, "ensemble.bed")
extract_ensemble(ensemble_tsv, ensemble_bed, False)

tmp_ = bedtools_intersect(
tumor_pred_vcf_file, region_bed_file, args=" -u", run_logger=logger)
Expand Down
Loading