Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[E::sam_parse1] CIGAR and query sequence are of different length #245

Closed
ksahlin opened this issue Feb 24, 2023 · 37 comments
Closed

[E::sam_parse1] CIGAR and query sequence are of different length #245

ksahlin opened this issue Feb 24, 2023 · 37 comments

Comments

@ksahlin
Copy link
Owner

ksahlin commented Feb 24, 2023

Hi @marcelm,

I am currently benchmarking v0.7.1 vs 0.8.0 vs commit eb7ef72 for the SNP/indel calling and get the below error

rule make_sorted_bam:
    input: /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.sam
    output: /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.bam
    jobid: 0
    wildcards: tool=strobealign_eb7ef72, dataset=BIO250

[E::sam_parse1] CIGAR and query sequence are of different length
samtools view: error reading file "/proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.sam": Input/output error
samtools view: error closing "/proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.sam": -5
[bam_sort_core] merging from 0 files and 16 in-memory blocks...
[Fri Feb 24 08:27:15 2023]
Error in rule make_sorted_bam:
    jobid: 0
    output: /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.bam

RuleException:
CalledProcessError in line 356 of /domus/h1/kris/source/alignment_evaluation/evaluation_VC/Snakefile:
Command 'set -o pipefail;  samtools view -@ 16 -u /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.sam | samtools sort -@ 16 -o /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.bam' returned non-zero exit status 1.
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2318, in run_wrapper
  File "/domus/h1/kris/source/alignment_evaluation/evaluation_VC/Snakefile", line 356, in __rule_make_sorted_bam
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 560, in _callback
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/concurrent/futures/thread.py", line 52, in run
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 546, in cached_or_run
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2349, in run_wrapper
Removing output files of failed job make_sorted_bam since they might be corrupted:
/proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.bam

Here is the run command

strobealign-eb7ef72 -t 16 -r 250 -o /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.sam /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/data/HG004/BIO250/reads_R1.fastq /proj/snic2022-6-31/nobackup/data/HG004/BIO250/reads_R2.fastq

Full output of the strobealign job below. Also note the warnings (that I have not seen before).

This is strobealign v0.8.0-107-geb7ef72
Time reading reference: 7.53 s
Reference size: 3099.92 Mbp (195 contigs; largest: 248.96 Mbp)
Indexing ...
  Time generating seeds: 186.89 s
  Time estimating number of unique hashes: 10.14 s
  Time sorting non-unique seeds: 2.37 s
  Time generating hash table index: 2.32 s
Total time indexing: 201.78 s
Running in paired-end mode
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Done!
Total mapping sites tried: 373090331
Total calls to ssw: 40200451
Calls to ksw (rescue mode): 40200451
Did not fit strobe start site: 8738629
Tried rescue: 15625370
Total time mapping: 2146.95 s.
Total time reading read-file(s): 16.42 s.
Total time creating strobemers: 171.01 s.
Total time finding NAMs (non-rescue mode): 343.23 s.
Total time finding NAMs (rescue mode): 165.72 s.
Total time sorting NAMs (candidate sites): 536.19 s.
Total time base level alignment (ssw): 1389.16 s.
Total time writing alignment to files: 0.00 s.
	Command being timed: "strobealign-eb7ef72 -t 16 -r 250 -o /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.sam /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/data/HG004/BIO250/reads_R1.fastq /proj/snic2022-6-31/nobackup/data/HG004/BIO250/reads_R2.fastq"
	User time (seconds): 33420.12
	System time (seconds): 1108.16
	Percent of CPU this job got: 1463%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 39:19.30
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 22302784
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 5
	Minor (reclaiming a frame) page faults: 539861156
	Voluntary context switches: 20575
	Involuntary context switches: 60662
	Swaps: 0
	File system inputs: 3451864
	File system outputs: 246435376
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
@ksahlin
Copy link
Owner Author

ksahlin commented Feb 24, 2023

I may also mention that the speed difference between the three versions on this 2*250bp dataset is marginal. Pasted below.

I am starting to believe the v0.7.1 time measures of the NAM sorting is wrong.

Commit eb7ef72 (installed as newest instructions with cmake, i.e. cmake -B build -DCMAKE_C_FLAGS="-march=native" -DCMAKE_CXX_FLAGS="-march=native")

Total mapping sites tried: 373090331
Total calls to ssw: 40200451
Calls to ksw (rescue mode): 40200451
Did not fit strobe start site: 8738629
Tried rescue: 15625370
Total time mapping: 2146.95 s.
Total time reading read-file(s): 16.42 s.
Total time creating strobemers: 171.01 s.
Total time finding NAMs (non-rescue mode): 343.23 s.
Total time finding NAMs (rescue mode): 165.72 s.
Total time sorting NAMs (candidate sites): 536.19 s.
Total time base level alignment (ssw): 1389.16 s.
Total time writing alignment to files: 0.00 s.

v0.8.0 (cmake)

Total mapping sites tried: 373090489
Total calls to ssw: 115247221
Calls to ksw (rescue mode): 40200726
Did not fit strobe start site: 8738600
Tried rescue: 15625370
Total time mapping: 2047.11 s.
Total time reading read-file(s): 16.61 s.
Total time creating strobemers: 181.31 s.
Total time finding NAMs (non-rescue mode): 327.49 s.
Total time finding NAMs (rescue mode): 159.41 s.
Total time sorting NAMs (candidate sites): 512.79 s.
Total time base level alignment (ssw): 1304.86 s.
Total time writing alignment to files: 0.00 s.

v0.7.1

Using rescue cutoff: 84
Running PE mode
Done!
Total mapping sites tried: 374508959
Total calls to ssw: 115870869
Calls to ksw (rescue mode): 40069880
Did not fit strobe start site: 8679805
Tried rescue: 15674982
Total time mapping: 2302.98 s.
Total time reading read-file(s): 20.3734 s.
Total time creating strobemers: 208.319 s.
Total time finding NAMs (non-rescue mode): 428.788 s.
Total time finding NAMs (rescue mode): 186.507 s.
Total time sorting NAMs (candidate sites): 26.6265 s.
Total time reverse compl seq: 0 s.
Total time base level alignment (ssw): 1365.13 s.
Total time writing alignment to files: 0 s.

@ksahlin
Copy link
Owner Author

ksahlin commented Feb 24, 2023

Perhaps related, I also get a segmentation fault error in bcftools mpileup for SIM250 dataset (simulated 2*250bp reads)

rule bcftools_mpileup:
    input: /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/SIM250.bam, /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    output: /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_eb7ef72/SIM250.vcf.gz
    jobid: 0
    wildcards: tool=strobealign_eb7ef72, dataset=SIM250

[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
/usr/bin/bash: line 1:  8315 Segmentation fault      (core dumped) bcftools mpileup --threads 16 -O z --fasta-ref /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/SIM250.bam > /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_eb7ef72/SIM250.vcf.gz
[Fri Feb 24 09:18:51 2023]
Error in rule bcftools_mpileup:
    jobid: 0
    output: /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_eb7ef72/SIM250.vcf.gz

RuleException:
CalledProcessError in line 367 of /domus/h1/kris/source/alignment_evaluation/evaluation_VC/Snakefile:
Command 'set -o pipefail;  bcftools mpileup --threads 16 -O z --fasta-ref /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/SIM250.bam > /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_eb7ef72/SIM250.vcf.gz' returned non-zero exit status 139.

command:

strobealign-eb7ef72 -t 16 -r 250 -o /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/SIM250.sam /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/data/HG004/SIM250/reads_R1.fastq /proj/snic2022-6-31/nobackup/data/HG004/SIM250/reads_R2.fastq

Interestingly, On the simulated 2*250bp reads, the time difference in alignments are a lot larger almost 1.4x speedup:

v0.7.1

Total mapping sites tried: 462403523
Total calls to ssw: 49830769
Calls to ksw (rescue mode): 14449902
Did not fit strobe start site: 807582
Tried rescue: 18118441
Total time mapping: 2366.61 s.
Total time reading read-file(s): 31.5706 s.
Total time creating strobemers: 346.908 s.
Total time finding NAMs (non-rescue mode): 723.833 s.
Total time finding NAMs (rescue mode): 216.319 s.
Total time sorting NAMs (candidate sites): 40.3552 s.
Total time reverse compl seq: 0 s.
Total time base level alignment (ssw): 907.439 s.
Total time writing alignment to files: 0 s.

Commit eb7ef72

Total mapping sites tried: 459271867
Total calls to ssw: 14612692
Calls to ksw (rescue mode): 14612692
Did not fit strobe start site: 707189
Tried rescue: 18086605
Total time mapping: 1793.47 s.
Total time reading read-file(s): 28.40 s.
Total time creating strobemers: 289.29 s.
Total time finding NAMs (non-rescue mode): 547.71 s.
Total time finding NAMs (rescue mode): 192.14 s.
Total time sorting NAMs (candidate sites): 779.98 s.
Total time base level alignment (ssw): 643.37 s.
Total time writing alignment to files: 0.00 s.

@ksahlin
Copy link
Owner Author

ksahlin commented Feb 24, 2023

Final note, the cigar error is also present in the BIO150 dataset. Run as:

strobealign-eb7ef72 -t 16 -r 150 -o /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO150.sam /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/data/HG004/BIO150/reads_R1.fastq /proj/snic2022-6-31/nobackup/data/HG004/BIO150/reads_R2.fastq

@marcelm
Copy link
Collaborator

marcelm commented Feb 24, 2023

I was able to reproduce this. Both the warning message ("The alignment path of one pair ...") and the inconsistent CIGAR come from some changes made in SSW. I updated our copy of SSW as part of PR #242. Let me see what can be done about this. Perhaps I should just go back to the older SSW version.

@ksahlin
Copy link
Owner Author

ksahlin commented Feb 24, 2023

Okay great. I guess SSW is still used in the rescue alignment step. Didn't think of that.

@marcelm
Copy link
Collaborator

marcelm commented Feb 24, 2023

Yes, it’s still used there. I can probably replace it with ksw2 as well.

@marcelm
Copy link
Collaborator

marcelm commented Feb 24, 2023

Should be fixed now, see my comment in #242.

@ksahlin
Copy link
Owner Author

ksahlin commented Feb 24, 2023

Okay I will start a benchmark of commit edeb390

@ksahlin
Copy link
Owner Author

ksahlin commented Feb 25, 2023

The make_sorted_bam rule completes with the fix in edeb390 for all four datasets (SIM150, SIM250, BIO15,BIO250) but the bcftools_mpileup step still gets a segmentation fault for all four instances.

EXMAPLE COMMAND BCFTOOLS

bcftools mpileup --threads 16 -O z --fasta-ref /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_main/SIM150.bam > /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_main/SIM150.vcf.gz

The four strobealign bam files are found at /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_main/*.bam

ERROR:

[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
/usr/bin/bash: line 1: 11999 Segmentation fault      (core dumped) bcftools mpileup --threads 16 -O z --fasta-ref /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_main/SIM150.bam > /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_main/SIM150.vcf.gz
[Sat Feb 25 10:13:13 2023]

the error happens after 30s, 1m20s, 1m30s, and 14m in the four files it it is of any help.

@marcelm
Copy link
Collaborator

marcelm commented Feb 27, 2023

I want to make sure I test this with the same bcftools version as you. Did you just use module load bcftools to make it available?

@marcelm
Copy link
Collaborator

marcelm commented Feb 27, 2023

I was able to reproduce the problem with bcftools 1.14, 1.15 and 1.16.

@marcelm
Copy link
Collaborator

marcelm commented Feb 27, 2023

Ah, this is samtools/bcftools#1805, fixed in bcftools 1.17.

I was able to reproduce the problem with a SAM file that looks (shortened) like this:

simulated.265988540	83	chr1	7114818	60	120=1X29=49D	...
simulated.201489102	163	chr1	7114821	60	141=49D9S	...
simulated.294364461	147	chr1	7114822	4	146=49D4=	...

The description in the issue isn’t clear about what the actual problem is, so I can only guess that the long deletion has something to do with it.

Having a CIGAR end with deletions doesn’t make much sense so maybe we should fix that on our side as well. (It does not appear to be the direct cause, though, since there are more alignments with deletions at the end preceding the above records.) It is a result of the NAM being longer on the reference than it should be. Perhaps this is another of these "edge effects", but for the 3' end this time?

@ksahlin
Copy link
Owner Author

ksahlin commented Feb 27, 2023

Perhaps this is another of these "edge effects", but for the 3' end this time?

Yes, probably. At least it matches my intuition. So we have the STR edge effects both in 5' and 3' ends. Since alignments seem to always be left-shifted, this results in an internal deletion in the 5' case. I am wondering why it does not become a softclipp in the 3', and I am in general confused about the cigar string 141=49D9S.

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 1, 2023

I updated to use bcftools v1.17.

  • SIM150 failed at a bcftools mpileup step, but it could be an Uppmax related IOError, I will rerun that experiment.
  • SIM250 is still running.

Below are the results I currently have comparing v0.7.1, v0.8.0, and commit edeb390 (called main in the table below). Sorry for the current formatting.

Main takaways:

  • V0.7.1 and v0.8.0 have identical results (up to a 0.1% precision) in all but two datasets:
    • BIO150 for indels (same recall and precision but F-score diff with <0.1%, probably tipping the rounding decimal)
    • BIO250 for recall and precision in SNV calling - v0.8.0 has higher recall and lower precision.
  • New approach (denoted main) has slightly lower SNV recall and similar or slightly lower precision. However, main really lose in the indel prediction with both lower precision and recall (on the biological data at least).

I want to wait until the simulated data is in hand first before making strong conclusions, since the biological indel calling analysis may not be 100% accurate. I simply calculated the intersection (bcftools isec) of indels between gold standard (GIAB) and predictions. SIM datasets should be more controlled. However, it could agree with the intuition that a full semi-global alignment would be able to span some longer indels over the new approach. Let's see when the results from simulated data are ready.

dataset,tool,variant,recall,precision,f_score
BIO150,strobealign_main,SNV,97.8,85.9,91.4
BIO150,strobealign_main,INDEL,7.8,4.7,5.9
BIO150,strobealign_v071,SNV,97.9,85.9,91.5
BIO150,strobealign_v071,INDEL,8.0,5.5,6.6
BIO150,strobealign_v080_cmake,SNV,97.9,85.9,91.5
BIO150,strobealign_v080_cmake,INDEL,8.0,5.5,6.5
-----------------------------------------------
BIO250,strobealign_main,SNV,95.8,81.8,88.3
BIO250,strobealign_main,INDEL,7.6,5.4,6.3
BIO250,strobealign_v071,SNV,95.8,82.0,88.4
BIO250,strobealign_v071,INDEL,7.8,6.2,6.9
BIO250,strobealign_v080_cmake,SNV,95.9,81.9,88.4
BIO250,strobealign_v080_cmake,INDEL,7.8,6.2,6.9
-----------------------------------------------
SIM250,strobealign_main,SNV,98.5,99.5,99.0
SIM250,strobealign_main,INDEL,67.5,59.9,63.5
SIM250,strobealign_v071,SNV,98.5,99.6,99.0
SIM250,strobealign_v071,INDEL,50.6,53.9,52.2
SIM250,strobealign_v080_cmake,SNV,98.5,99.6,99.0
SIM250,strobealign_v080_cmake,INDEL,50.6,53.9,52.2
-----------------------------------------------
SIM150,strobealign_main,SNV,97.3,99.5,98.4
SIM150,strobealign_main,INDEL,64.3,53.6,58.5
SIM150,strobealign_v071,SNV,97.4,99.5,98.4
SIM150,strobealign_v071,INDEL,49.7,53.7,51.6
SIM150,strobealign_v080_cmake,SNV,97.4,99.5,98.4
SIM150,strobealign_v080_cmake,INDEL,49.7,53.7,51.6

Edit: strobealign_main results were added for SIM150 and SIM250 datasets

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 1, 2023

Well, lo and behold, I updated the table above with the SNV and indel calling on the SIM250 data. INDEL calling results with main are so much better than previous versions of strobealign as well as other aligners (see fig from paper below, >10%!) that I am skeptical. I did not change anything in the analysis (same datasets) but previous results were run with bcftools 1.15 and main is now using v1.17 due to above described bug. I looked at the changelog of v1.16 and v1.17 and nothing suggests a change in calling algorithm in bcftools call.

I do not have any insights at the moment. As you have been woking on the new extension level alignments @marcelm, do you have an idea of what could explain these results? Would better softclipping decisions improve the results this drastically? Variants in STR regions? (the simulations and biodata are using hg38, not chm13)

Screenshot 2023-03-01 at 20 11 37

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 1, 2023

As for runtime, here are some metrics. Speedgain on simulated data is substantial(!) for edeb390 but it doesn't seem translate as well to biological data.

tool,dataset,read_length,time,memory
strobealign_v071,BIO150,150,4486.368,32.366352
strobealign_v080_cmake,BIO150,150,3963.96,22.373084
strobealign_main,BIO150,150,4402.06,22.373108
------------------------------------
strobealign_v071,BIO250,250,2307.502,33.476796
strobealign_v080_cmake,BIO250,250,2057.87,22.302712
strobealign_main,BIO250,250,2130.21,22.302728
------------------------------------
strobealign_v071,SIM150,150,2990.5989999999997,32.1321
strobealign_v080_cmake,SIM150,150,2514.6299999999997,22.373012
strobealign_main,SIM150,150,2452.14,22.373144
------------------------------------
strobealign_v071,SIM250,250,2371.205,33.313704
strobealign_v080_cmake,SIM250,250,1984.1,22.30268
strobealign_main,SIM250,250,1778.55,22.30276

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 2, 2023

I also added SIM150 results to the table above. The recall on this dataset has gone up by nearly 15 percentage points..

(Previous mplieup that error I reported was likely an uppmax diskIO related as a restart finished without errors)

@jkbonfield
Copy link

jkbonfield commented Mar 2, 2023

I did not change anything in the analysis (same datasets) but previous results were run with bcftools 1.15 and main is now using v1.17 due to above described bug. I looked at the changelog of v1.16 and v1.17 and nothing suggests a change in calling algorithm in bcftools call.

"bcftools call" rarely changes. It's mostly mpileup where improvements occur, which adds the tags used by call.

It's worth rerunning the older aligner results with bcftools 1.17 so you have a safer comparison. Bcftools is regularly getting updates, although the most major revamp to indel calling was done by myself in 1.13 (see graphs at samtools/bcftools#1474). Since then there have been fixes to some of the FORMAT fields, so potentially you may get changes if you're applying post-call filtering.

We also reversed a slow down that had accidentally incorporated some experimental code with a major performance cost. But that's just speed of running bcftools rather than results changing I think. It's also possible some other changes had the unintented (but apparently good!) effect of improving calling. [The primary author doesn't usually run large truth set comparisons to assess the impact of code changes, so wouldn't spot such things.]

Edit: there's also the open PR of samtools/bcftools#1679 which hugely improves indel calling further (between 10-30% fewer indel FP and 5-10% fewer indel FN), but it's been rejected in favour of a major rewrite I think. This ongoing rewrite is visible in 1.17 as bcftools mpileup --indels-2.0, but due to the lack of heuristics added in PR 1679 it's currently poorer than both develop and that PR for false positives (although it is a little bit better at recall). Personally I'd prefer both: a patch to make the current algorithm better while the new one is being developed, but not my call.

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 2, 2023

It's worth rerunning the older aligner results with bcftools 1.17 so you have a safer comparison. Bcftools is regularly getting updates, although the most major revamp to indel calling was done by myself in 1.13 (see graphs at samtools/bcftools#1474). Since then there have been fixes to some of the FORMAT fields, so potentially you may get changes if you're applying post-call filtering.

Thanks you for the feedback @jkbonfield ! I am currently rerunning all the bcftools steps (mpileup, call, sort, index, isec) on all strobealign versions as well as BWA MEM. Will update once I have those results.

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 2, 2023

So I reran bwa mem as well as strobealign v0.7.1, 'v0.8.0' (which is really commit 8e53e41) and 'main' (which is really commit edeb390). This rerun triggers all the bcftools steps using v1.17. Results below.

In summary: commit edeb390 is so much better at indel recall (15 and 17 percent points for 150PE and 250PE) than previous versions that it is nearly unbelievable. I verified that all files are updated looking at dates etc. I hope I have not made a silly mistake elsewhere.

I tried comparing cigar strings but realised that my pipeline discards the SAM files once sorted bams are created, so that prevents a quick comparison of differing lines on my end. @marcelm, do you have any ideas why we see such huge improvements in indel recall, or if not, could you look a bit into this? (one thing I saw changed was that we trim away /1 and /2 in main. Would this affect e.g., bcftools not treating reads as PE reads in previous versions - which maybe increases calling power?)

dataset,tool,variant,recall,precision,f_score
SIM150,strobealign_main,SNV,97.3,99.5,98.4
SIM150,strobealign_main,INDEL,64.3,53.6,58.5
SIM150,strobealign_v071,SNV,97.4,99.5,98.4
SIM150,strobealign_v071,INDEL,49.6,53.6,51.5
SIM150,strobealign_v080_cmake,SNV,97.4,99.5,98.4
SIM150,strobealign_v080_cmake,INDEL,49.6,53.6,51.5
SIM150,bwa_mem,SNV,97.9,99.6,98.7
SIM150,bwa_mem,INDEL,45.9,53.6,49.5
------------------------------------------------------
SIM250,strobealign_main,SNV,98.5,99.5,99.0
SIM250,strobealign_main,INDEL,67.5,59.9,63.5
SIM250,strobealign_v080_cmake,SNV,98.5,99.6,99.0
SIM250,strobealign_v080_cmake,INDEL,50.5,53.9,52.1
SIM250,strobealign_v071,SNV,98.5,99.6,99.0
SIM250,strobealign_v071,INDEL,50.5,53.9,52.1
SIM250,bwa_mem,SNV,98.9,99.5,99.2
SIM250,bwa_mem,INDEL,50.4,53.7,52.0

@marcelm
Copy link
Collaborator

marcelm commented Mar 2, 2023

I don’t know why this would differ so much either. As a first thing, I’d try to open the BAM files in IGV and look at some regions with differently called indels. I can do this next week.

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 2, 2023

One idea: the TP and FP calls are simply derived as bcftools isec --threads 16 --nfiles 2 -O u {input.vcf_SNV_true} {variants_SNV_sorted_vcf_gz}, this is relatively robust for SNP (same position SNP gives an overlap).

However, if there for some reason are many redundant (overlapping) indel calls overlapping with a true indel, I think a TP will be counted one time for each overlapping indel call with isec. Maybe main is having problem with this?

One thing that lead me to this suspicion is that the BIO results are as follows, suggesting lower precision in indel calls. Although in BIO data ground truth is even more difficult to decipher because some larger indels are present in ground truth which should probably be filtered out (they may create spurious TP).

BIO150,strobealign_main,SNV,97.8,85.9,91.4
BIO150,strobealign_main,INDEL,7.8,4.7,5.9
BIO150,strobealign_v071,SNV,97.9,85.9,91.5
BIO150,strobealign_v071,INDEL,8.0,5.8,6.7
BIO150,strobealign_v080_cmake,SNV,97.9,85.9,91.5
BIO150,strobealign_v080_cmake,INDEL,8.0,5.7,6.7
---------------------------------------------
BIO250,strobealign_main,SNV,95.8,81.8,88.3
BIO250,strobealign_main,INDEL,7.6,5.4,6.3
BIO250,strobealign_v071,SNV,95.8,82.0,88.4
BIO250,strobealign_v071,INDEL,7.8,6.4,7.0
BIO250,strobealign_v080_cmake,SNV,95.9,81.9,88.4
BIO250,strobealign_v080_cmake,INDEL,7.8,6.3,7.0

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 3, 2023

I took a look at the true positive predictions for indels (produced with bcftools isec). From my quick look, nothing suggests that there is anything suspicious going on such as redundant predictions, which was my previous hypothesis.

Below I have pasted the first 50 calls classified as true positives by bcftools isec for main, v0.7.1, and bwa mem for the SIM150 reads. They are sorted w.r.t. coordinates on GRCh38.

Rows with stars indicate calls that are not found by bwa mem and strobealign 0.8.0. All these calls match perfectly a simulated variant in the truth set (pasted furthest down).

(Also, I added results for bwa mem in the table above. No significant change with bcftools v1.17)

STROBEALIGN MAIN

(base) [kris@rackham3 ~]$ head -n 50 /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_main/SIM150.variants.INDEL.ovl/sites.txt 
chr1	52171	TGATCTAA	T	11
chr1	92584	CCAGTGGCACCCTGACTGGCTTGGT	C	11
chr1	125918	ATAAAATAGAAAATCTCTGTCAAAACTAAAAGGAAAGATGCAT	A	11
chr1	283067	TTAATATATAACTGG	T	11
chr1	286391	TAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTA	T	11
* chr1	535531	A	AGTGAA	11
chr1	619526	T	TCCTGAGCTGGATCAGGGACATTCCTA	11
chr1	629866	TCAACA	T	11
chr1	760358	TCTTCTAATGTAATAAAATTTGCTTCGGC	T	11
chr1	770917	A	AAAGCCCTGTTTCTTCTGGGTGGGCCGAAAAT	11
* chr1	786095	C	CTGCTTGAACCTCAGCCGCCACCCCGAACAATTTACATC	11
chr1	798649	TATCTTTTTCTGTGTAAGAAAAGTCCTGCTAACTTAGGAAA	T	11
chr1	819783	G	GTGAGGGAACGCGATCCATTGAGTAGCTTTGCCGCGCAAATGTGCAGC	11
* chr1	823128	ACTGTCGA	A	11
chr1	824236	TGCGTACATATATTTAA	T	11
chr1	836291	AGTCCCTGGCTCCATGTGG	A	11
chr1	855681	ATGGCCAATCCGTCAG	A	11
chr1	875485	G	GCCGTATTAAAGGGGGGTATCGTGTGTGGTTCATAAAAGCGCGGCTGA	11
chr1	879789	T	TCCCCACTTGTGTTCCTCCAATTTGACGGAGTCACTTGTGAAAGCGG	11
chr1	888715	A	ACTTGACGACACCGCATATCACTTCCC	11
chr1	898531	T	TATCACATGTGCTTAGCACATCGAGCG	11
chr1	908621	G	GGTTAACATTTGTGGTCTCGCCTGTGA	11
chr1	913271	A	AAGTGAACTG	11
chr1	922037	CCAGAAACCCAGG	C	11
chr1	972607	A	ATTAACCATACGATGGCT	11
chr1	974202	TGGGACTGGGGAGGGAGAGCAGGGGG	T	11
* chr1	982067	GGGTCCCGGCGGCCGAGCTGGGCGTCTGAGCTGCCCGTCCTGGGTCGG	G	11
chr1	982701	A	AGATTGGCATGACAGTAATAGCGACGTGGGGCAAGGGGCGTTGG	11
chr1	996668	CCCCGCGGGAGGGGGCACCTCACGTCTGGGGCCACA	C	11
chr1	1043129	CTTCCTCCACCATCCCCTCCCTGGA	C	11
chr1	1044304	GCGGCCGCTCACACTGACACCACCC	G	11
chr1	1064273	AGCTT	A	11
chr1	1069991	TCATGTGACCACCCGTGCAGCGTCACGCGCCTGGCCCTGCCGATGAC	T	11
chr1	1084687	A	AATGAG	11
* chr1	1109159	C	CCC	11
chr1	1115974	GC	G	11
chr1	1118050	TTCCTGAAGACCAGCGTTTCCCGGGTGGTTTCACAGCTG	T	11
chr1	1124569	C	CAGTTATTGTTGAACTGAGATAGTATTAATCTAAGGACTTAA	11
* chr1	1128083	CTGACTGCCAAGGGCAGCGACTGCAAGAC	C	11
chr1	1135476	A	AGATCAT	11
chr1	1149097	TTTCAGGATAAGA	T	11
* chr1	1156004	CCCACAC	C	11
chr1	1157057	G	GACCCAGCAAGCTACT	11
chr1	1168799	G	GCGTAAGAGCTGCGGCCTGTTAAGCGCGCCCGGTGTACTAGACGCTCCCC	11
* chr1	1189269	GGAAGTTCCCTTCTGTTCCTAATTGG	G	11
chr1	1198844	CACGGCCAAGTCGCCGG	C	11
chr1	1212269	AAGGACGCTGCCCAGCCCCACGTGCC	A	11
chr1	1270788	CCTGCCAGCCCCTTGACA	C	11
chr1	1274202	TAGGGGAGGC	T	11
chr1	1286130	CTGTCCCTGG	C	11

BWA MEM

(base) [kris@rackham3 ~]$ head -n 50 /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/bwa_mem/SIM150.variants.INDEL.ovl/sites.txt 
chr1	49627	G	GTTCTAGTTGGC	11
chr1	52171	TGATCTAA	T	11
chr1	92584	CCAGTGGCACCCTGACTGGCTTGGT	C	11
chr1	99579	CGTTTGTTAAAAGATATTATTTTGCTTTACACTTTTTCTCTCAGAAATAAG	C	11
chr1	125918	ATAAAATAGAAAATCTCTGTCAAAACTAAAAGGAAAGATGCAT	A	11
chr1	283067	TTAATATATAACTGG	T	11
chr1	286391	TAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTA	T	11
chr1	370401	TCTCCATGTCCTGCTGGTTAAG	T	11
chr1	615614	A	ATACATCGTAACT	11
chr1	619526	T	TCCTGAGCTGGATCAGGGACATTCCTA	11
chr1	629866	TCAACA	T	11
chr1	711525	ACAGAAGAAAATGTCTCCGGCAATAAATCAC	A	11
chr1	753148	A	AATTGGGAAACCAAT	11
chr1	760358	TCTTCTAATGTAATAAAATTTGCTTCGGC	T	11
chr1	770917	A	AAAGCCCTGTTTCTTCTGGGTGGGCCGAAAAT	11
chr1	798649	TATCTTTTTCTGTGTAAGAAAAGTCCTGCTAACTTAGGAAA	T	11
chr1	824236	TGCGTACATATATTTAA	T	11
chr1	836291	AGTCCCTGGCTCCATGTGG	A	11
chr1	855681	ATGGCCAATCCGTCAG	A	11
chr1	888715	A	ACTTGACGACACCGCATATCACTTCCC	11
chr1	898531	T	TATCACATGTGCTTAGCACATCGAGCG	11
chr1	908621	G	GGTTAACATTTGTGGTCTCGCCTGTGA	11
chr1	913271	A	AAGTGAACTG	11
chr1	922037	CCAGAAACCCAGG	C	11
chr1	972607	A	ATTAACCATACGATGGCT	11
chr1	974202	TGGGACTGGGGAGGGAGAGCAGGGGG	T	11
chr1	982701	A	AGATTGGCATGACAGTAATAGCGACGTGGGGCAAGGGGCGTTGG	11
chr1	996668	CCCCGCGGGAGGGGGCACCTCACGTCTGGGGCCACA	C	11
chr1	1043129	CTTCCTCCACCATCCCCTCCCTGGA	C	11
chr1	1044304	GCGGCCGCTCACACTGACACCACCC	G	11
chr1	1064273	AGCTT	A	11
chr1	1069991	TCATGTGACCACCCGTGCAGCGTCACGCGCCTGGCCCTGCCGATGAC	T	11
chr1	1084687	A	AATGAG	11
chr1	1115974	GC	G	11
chr1	1118050	TTCCTGAAGACCAGCGTTTCCCGGGTGGTTTCACAGCTG	T	11
chr1	1124569	C	CAGTTATTGTTGAACTGAGATAGTATTAATCTAAGGACTTAA	11
chr1	1135476	A	AGATCAT	11
chr1	1149097	TTTCAGGATAAGA	T	11
chr1	1157057	G	GACCCAGCAAGCTACT	11
chr1	1198844	CACGGCCAAGTCGCCGG	C	11
chr1	1212269	AAGGACGCTGCCCAGCCCCACGTGCC	A	11
chr1	1270788	CCTGCCAGCCCCTTGACA	C	11
chr1	1274202	TAGGGGAGGC	T	11
chr1	1286130	CTGTCCCTGG	C	11
chr1	1341760	A	ATCGATAC	11
chr1	1383117	A	ACGAGGGTTTTGGGTGACCGAGTCCCC	11
chr1	1390838	GGTGCGA	G	11
chr1	1399183	A	AGTTCCAAAGTACCGGTACTATTACTTGAGCCCTTATACATC	11
chr1	1421032	CACT	C	11
chr1	1440681	AGTCCTCGGCGTCACGCAATGCTCACCTCGCCTCTTCC	A	11

STROBEALIGN V0.8.0

(base) [kris@rackham3 ~]$ head -n 50 /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_v080_cmake/SIM150.variants.INDEL.ovl/sites.txt 
chr1	52171	TGATCTAA	T	11
chr1	92584	CCAGTGGCACCCTGACTGGCTTGGT	C	11
chr1	125918	ATAAAATAGAAAATCTCTGTCAAAACTAAAAGGAAAGATGCAT	A	11
chr1	283067	TTAATATATAACTGG	T	11
chr1	286391	TAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTA	T	11
chr1	619526	T	TCCTGAGCTGGATCAGGGACATTCCTA	11
chr1	629866	TCAACA	T	11
chr1	760358	TCTTCTAATGTAATAAAATTTGCTTCGGC	T	11
chr1	770917	A	AAAGCCCTGTTTCTTCTGGGTGGGCCGAAAAT	11
chr1	798649	TATCTTTTTCTGTGTAAGAAAAGTCCTGCTAACTTAGGAAA	T	11
chr1	819783	G	GTGAGGGAACGCGATCCATTGAGTAGCTTTGCCGCGCAAATGTGCAGC	11
chr1	824236	TGCGTACATATATTTAA	T	11
chr1	836291	AGTCCCTGGCTCCATGTGG	A	11
chr1	855681	ATGGCCAATCCGTCAG	A	11
chr1	875485	G	GCCGTATTAAAGGGGGGTATCGTGTGTGGTTCATAAAAGCGCGGCTGA	11
chr1	879789	T	TCCCCACTTGTGTTCCTCCAATTTGACGGAGTCACTTGTGAAAGCGG	11
chr1	888715	A	ACTTGACGACACCGCATATCACTTCCC	11
chr1	898531	T	TATCACATGTGCTTAGCACATCGAGCG	11
chr1	908621	G	GGTTAACATTTGTGGTCTCGCCTGTGA	11
chr1	913271	A	AAGTGAACTG	11
chr1	922037	CCAGAAACCCAGG	C	11
chr1	972607	A	ATTAACCATACGATGGCT	11
chr1	974202	TGGGACTGGGGAGGGAGAGCAGGGGG	T	11
chr1	982701	A	AGATTGGCATGACAGTAATAGCGACGTGGGGCAAGGGGCGTTGG	11
chr1	996668	CCCCGCGGGAGGGGGCACCTCACGTCTGGGGCCACA	C	11
chr1	1043129	CTTCCTCCACCATCCCCTCCCTGGA	C	11
chr1	1044304	GCGGCCGCTCACACTGACACCACCC	G	11
chr1	1064273	AGCTT	A	11
chr1	1069991	TCATGTGACCACCCGTGCAGCGTCACGCGCCTGGCCCTGCCGATGAC	T	11
chr1	1084687	A	AATGAG	11
chr1	1115974	GC	G	11
chr1	1118050	TTCCTGAAGACCAGCGTTTCCCGGGTGGTTTCACAGCTG	T	11
chr1	1124569	C	CAGTTATTGTTGAACTGAGATAGTATTAATCTAAGGACTTAA	11
chr1	1135476	A	AGATCAT	11
chr1	1149097	TTTCAGGATAAGA	T	11
chr1	1157057	G	GACCCAGCAAGCTACT	11
chr1	1168799	G	GCGTAAGAGCTGCGGCCTGTTAAGCGCGCCCGGTGTACTAGACGCTCCCC	11
chr1	1198844	CACGGCCAAGTCGCCGG	C	11
chr1	1212269	AAGGACGCTGCCCAGCCCCACGTGCC	A	11
chr1	1270788	CCTGCCAGCCCCTTGACA	C	11
chr1	1274202	TAGGGGAGGC	T	11
chr1	1286130	CTGTCCCTGG	C	11
chr1	1341760	A	ATCGATAC	11
chr1	1383117	A	ACGAGGGTTTTGGGTGACCGAGTCCCC	11
chr1	1390838	GGTGCGA	G	11
chr1	1399183	A	AGTTCCAAAGTACCGGTACTATTACTTGAGCCCTTATACATC	11
chr1	1421032	CACT	C	11
chr1	1440681	AGTCCTCGGCGTCACGCAATGCTCACCTCGCCTCTTCC	A	11
chr1	1441544	G	GT	11
chr1	1446356	A	ATCTGCTCCGATAAAGCAGGTCGTAAGTCTTTACTCTGCCTGCCTCAAG	11

SIMULATED VARIANTS


#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	simulated
chr1	43416	sim_small_indel_0	AAT	A	.	PASS	.	.	.
chr1	43946	sim_small_indel_1	A	AAGCGAATATGCGGGCGTCAGAGTT	.	PASS	.	.	.
chr1	49627	sim_small_indel_2	G	GTTCTAGTTGGC	.	PASS	.	.	.
chr1	52171	sim_small_indel_3	TGATCTAA	T	.	PASS	.	.	.
chr1	55465	sim_small_indel_4	TTT	T	.	PASS	.	.	.
chr1	81671	sim_small_indel_5	T	TGTCACACGCTTGCTAGTCATCTGAGCCTAGAAAAGCTACCACAA	.	PASS	.	.	.
chr1	86735	sim_small_indel_6	T	TAATCTAACCCGGACACATTTATGGTGGGCCG	.	PASS	.	.	.
chr1	92584	sim_small_indel_7	CCAGTGGCACCCTGACTGGCTTGGT	C	.	PASS	.	.	.
chr1	95760	sim_small_indel_8	C	CCGTCTGCTGGTAATAAGGCTGGCTCTCGCCCCACCTCAG	.	PASS	.	.	.
chr1	97153	sim_small_indel_9	A	ATTTTTTTAATCGCACCTCTT	.	PASS	.	.	.
chr1	99579	sim_small_indel_10	CGTTTGTTAAAAGATATTATTTTGCTTTACACTTTTTCTCTCAGAAATAAG	C	.	PASS	.	.	.
chr1	112138	sim_small_indel_11	G	GACAGCTCG	.	PASS	.	.	.
chr1	120090	sim_small_indel_12	AAGAAAGCAAATTAACTTCTAACATACAAAGCCTTAGAGA	A	.	PASS	.	.	.
chr1	125918	sim_small_indel_13	ATAAAATAGAAAATCTCTGTCAAAACTAAAAGGAAAGATGCAT	A	.	PASS	.	.	.
chr1	129832	sim_small_indel_14	AGGCCAAGGT	A	.	PASS	.	.	.
chr1	140753	sim_sv_indel_0	GGAAGATTCTGTCTCAAAAAAAAGAAAAGAAATATATATTTAATCTCTGTCCCTGGTTCCTGGCACAGAGCTTCTAAAGCTCTTACAAAGACCTCAGTGATAGATGTGACAGGAGCATCTTTTGTTTTAATATTTGGTCTTGGTCCCAGGTTTCTAACACAAGAGCCTCTAAGAACTTTGGGATCTCCAGCATGGTAAGAATGCATTTGGGGATGTTGTTGAGATGACTGGGTGACTGCAAGCTCCTAAATTTCTTCAAGAGGAGGGCTGATTACCATGCAACCACATGGTAAGAGGCTTGGAACTTTCAGCCTCATGCACTGAACTCCAGGGGGAAGAGGGGCTGGAGACTGACTTAATCACCAACAGCCAAAGGTTTTATCAATCATGCTTGCATAATAAAGCCTCCATAAACACCCTGAAAGGGGTTTGCAGAGCTTTCAGGGTTGCTGGACACAGGAGATGCTGGGAGGGTCGCATGTTCAACAGAGGGCATGGGAGCTCTGTGCCCCTCCGAACTTAACTTGCCCTGGGTATCTTTCTTTTTTTTGAGACAGGATCAGGCTCTTTTGTCCAAGCTGGAGTGCAGTGGCACAATCTCAGCTTACTGTAACCTAAGCCTCCCCAGTCCCCAGCTCAAGGTATCCTCTCATCTCAGCTTCCCTAGTAGTTGGAACTCTAGGTGCACAACACCACACCAGTTATTATTATTATTTTTTAATTTTTTATAGAGACAGGTTTTCACCATGTTGCCCAGGCTGGTCTCAAACTCCTGAGTTTAAGCGATCCTCCCACCTTGGCCTCCCA	G	.	PASS	SVTYPE=DEL;SVLEN=-806	.	.
chr1	146737	sim_small_indel_15	ATTTGGATCCT	A	.	PASS	.	.	.
chr1	199395	sim_small_indel_16	TGGGGTCCCCTACGAGGGACCAGGAGCTCCGGGCGGGCAGCAGCTGCGGA	T	.	PASS	.	.	.
chr1	200080	sim_small_indel_17	CTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCCTACCCGTGCTTTC	C	.	PASS	.	.	.
chr1	270212	sim_small_indel_18	CCCTATGTCCCCAGGGCTCTCCATGTGTAGAGCTGAGACCATTTGC	C	.	PASS	.	.	.
chr1	271629	sim_small_indel_19	A	AGATGTGTGGAAAAGTGGTCAGACGCGTCTTAAGGCTCGTCGTGG	.	PASS	.	.	.
chr1	283067	sim_small_indel_20	TTAATATATAACTGG	T	.	PASS	.	.	.
chr1	286391	sim_small_indel_21	TAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTA	T	.	PASS	.	.	.
chr1	292694	sim_small_indel_22	G	GAACGTGGGCCCTGACAAGCCAATCGG	.	PASS	.	.	.
chr1	370401	sim_small_indel_23	TCTCCATGTCCTGCTGGTTAAG	T	.	PASS	.	.	.
chr1	384760	sim_small_indel_24	GACACACCAAAAACAAGCAAATTCCAGAGGATC	G	.	PASS	.	.	.
chr1	385852	sim_sv_indel_1	C	CCGAATTAGACACAAGATTCAGCCATCCGGAAATATGCCGATAGTTGCAATTTTTAAAAATTTATCTTC	.	PASS	SVTYPE=INS;SVLEN=68	.
chr1	393073	sim_small_indel_25	CTGGAGAG	C	.	PASS	.	.	.
chr1	407754	sim_small_indel_26	GCAAATTCTGTTGTTTGTATAAACATCAGC	G	.	PASS	.	.	.
chr1	435352	sim_small_indel_27	A	ACAGAGGAGAAGTACGTATGTGGGCGCCGTAATTGC	.	PASS	.	.	.
chr1	439005	sim_sv_indel_2	GGACATATATAACAAGCTCACAGCTCACATCATACCCAACAATGAAAAAGTGAAATCTTTTCTGCTAAGATCAAGAACAAGACAAGGATATTTATTCTCACTACTTCTATTCAACTTATTTCTGGAAGTCCTAGCCAGAGCAATTAAGCCAAATAAAGAAATAAAAGATATTCAAATTGAAAAGGAAGAAGTAAAATTGTCTCTGTTTGATGACATATTATATATAGGAAACCCTAAAAACTCCACCAAAAAGCTATTAGAAATGATAAATGAATTCAATAAAATTGCAGAATTCAAAATCAATATACAAAACTCAGTAGTTTCTTTACACTCACAACAAACTATATGACAAAAATAAAGAAATCAATCTCATTCACAGTAGCATCAAAAAAACTGTATTTTTTTTGTTTAGGAGCACATTTAGGATTGTACTTAGGAGTACATTTAACCAAGGAGGTGAAAGATCTGTATTCTGAACACTATAAAACATTGATGAAAAATTGTAGATGACACAAATACATGGAAAGATATTTTATGTTCATGGGTAGGAAGAATTAATATTCTTAAAATGTCCTTACTGCCCAAAGCGATTTATAGGTTTAATGCAACATTTATCAAAATTTCAATGTCATTCTTCACAGAAATAGAAAAAACAATTTGAAAATTTATATGGAACCACAAAGGACCCTGAATAACTAAAGCACTCTTGAGCAATAAGAACAAA	G	.	PASS	SVTYPE=DEL;SVLEN=-723	.	.
chr1	440769	sim_small_indel_28	A	ACCCATACAT	.	PASS	.	.	.
chr1	459461	sim_small_indel_29	G	GAGAACTCT	.	PASS	.	.	.
chr1	461504	sim_small_indel_30	AACTATCCACTAAAAATGAATAGAGATAACATGATACAAG	A	.	PASS	.	.	.
chr1	463600	sim_small_indel_31	T	TAGGGGAACTGACGCCTCACAGATAAACCACAATCGGTGTGACGA	.	PASS	.	.	.
chr1	495731	sim_small_indel_32	A	ATCTAGACGCCAGATCGGAGGGCAGG	.	PASS	.	.	.
chr1	515327	sim_small_indel_33	A	AGTCAGTCGAATAAGTA	.	PASS	.	.	.
chr1	530647	sim_small_indel_34	A	ATCTACA	.	PASS	.	.	.
chr1	533974	sim_small_indel_35	T	TAGACCGTCTTCAACCGGTT	.	PASS	.	.	.
chr1	535531	sim_small_indel_36	A	AGTGAA	.	PASS	.	.	.
chr1	598243	sim_sv_indel_3	G	GTTCGCCTTCAGGGGTTTGGATCGGATGGTTTATACGCGAATAGACACTACCTCTCCACCATGGCGAACACAGAAATCAATTTG	.	PASS	SVTYPE=INS;SVLEN=83	.	.
chr1	612741	sim_small_indel_37	CTGTAGTCCCAGCTACTCCGGA	C	.	PASS	.	.	.
chr1	615614	sim_small_indel_38	A	ATACATCGTAACT	.	PASS	.	.	.
chr1	619526	sim_small_indel_39	T	TCCTGAGCTGGATCAGGGACATTCCTA	.	PASS	.	.	.
chr1	625227	sim_small_indel_40	A	ACCTATCCTTCCGCATTCACTAGACAACGTACGGGAAGAAAATATGGACAA	.	PASS	.	.	.
chr1	626998	sim_small_indel_41	CTAA	C	.	PASS	.	.	.
chr1	629866	sim_small_indel_42	TCAACA	T	.	PASS	.	.	.
chr1	656338	sim_small_indel_43	A	AAGCATCGTTATTACCCCGTGACGCCCCAGGCCAGCCGATAGTGACG	.	PASS	.	.	.
chr1	667027	sim_small_indel_44	G	GGACAGTCCCTA	.	PASS	.	.	.
chr1	708142	sim_small_indel_45	C	CGCATGCCAATCCGCAGTAGCTATCCCCA	.	PASS	.	.	.
chr1	711525	sim_small_indel_46	ACAGAAGAAAATGTCTCCGGCAATAAATCAC	A	.	PASS	.	.	.
chr1	713164	sim_small_indel_47	TAGCTATTTGTAAATTGTCATGCATTGTTAATG	T	.	PASS	.	.	.
chr1	723221	sim_small_indel_48	T	TCTGGAGAGATCAATGGTGGGCGATTACTAAGTTTGA	.	PASS	.	.	.
chr1	734815	sim_small_indel_49	A	AACTTGGCTAATCCAATTCCCGCC	.	PASS	.	.	.
chr1	749841	sim_small_indel_50	G	GGAGTATCCCAACACCTAGCTCGGCATC	.	PASS	.	.	.
chr1	752489	sim_small_indel_51	A	AGACGACACCATTAGCCCAACGTGTCTACGAAACAGCGCCACGAGTCCTC	.	PASS	.	.	.
chr1	753148	sim_small_indel_52	A	AATTGGGAAACCAAT	.	PASS	.	.	.
chr1	760358	sim_small_indel_53	TCTTCTAATGTAATAAAATTTGCTTCGGC	T	.	PASS	.	.	.
chr1	770917	sim_small_indel_54	A	AAAGCCCTGTTTCTTCTGGGTGGGCCGAAAAT	.	PASS	.	.	.
chr1	784407	sim_sv_indel_4	CAACAGCAACAAGCACACCTGCTACCTAGAACTTGGTTTCTAAAAACCATCCTCCTCTAAAAGGAAACAGAGCTCTTTGAAGAAACGGTAATTTCAAAGCTAGAGCACAGAAAGTATAAGATGAGCCCAGAATATCTTTTTGTACAAGAAAGTAAGGCAATGCTCAAAGAATGATGGAGACATGCCCAAAAAAAGCAGAGAAGCCAACTTGAAAAGATTCCAACAGGCCAAATATGGGGCGATTTGAACATCAGAATAAAATGGTGACAGTCACAGAATTATAACCCAGTGAAGAAAGGAATTCATGACATCATATTAATATAAATAATAATTGAATATTGAAATTCGTTAAAGGAAATGAGATATTTATGTAGTCTTAAGGTATCTTCTCACAAATTATGTATTACTTACAAAGGGGAAAAGTGCACCTTGACGTGAGAATCTTGGCAGAACTACTTTAATCAAGAGGTTTAGTTGAGCATTATCAGTAACAGAGTAAATCAAAATCATAAGCCACTTGATAGATACAATGAGAAAATAAAGCATCACTTCTGTGACACCCTATAAAAAATGAATCTAATAATGAGGAAACATGAGAAAAAACCAAACTGAGGGATATTCTACAAATAAATTGTCCTGTAATCTTCAAAAATTTCAAGGTTATAAAAGTCAAGAAAAACAGACAATCTTTTCCAAACTCAAGGGAACTAAAGAGGCATGAAATCTAAATGCTTAATTCTGGATTTCATCTTTTTGATATAAAGAACAGTATTGAGACTATTGATAAAATTTGAGTGGGGTCTGAAGATTCATTGGTAGTAGTAAGCTCAATTTAATTTCCTCACTTTGATCATTGTACTCTGGTTTTTAGAATGTTCTAGTCAACAGGTAATATGAATTAAACTAATCGAGGGTGACAGGGCATTATATCAGCCACTTATCAACAAACATTTCAGAGAAAAATAGTTCCTTGTATTGTTATTG	C	.	PASS	SVTYPE=DEL;SVLEN=-983	.	.
chr1	786095	sim_small_indel_55	C	CTGCTTGAACCTCAGCCGCCACCCCGAACAATTTACATC	.	PASS	.	.	.
chr1	798649	sim_small_indel_56	TATCTTTTTCTGTGTAAGAAAAGTCCTGCTAACTTAGGAAA	T	.	PASS	.	.	.
chr1	819783	sim_small_indel_57	G	GTGAGGGAACGCGATCCATTGAGTAGCTTTGCCGCGCAAATGTGCAGC	.	PASS	.	.	.
chr1	823128	sim_small_indel_58	ACTGTCGA	A	.	PASS	.	.	.
chr1	824236	sim_small_indel_59	TGCGTACATATATTTAA	T	.	PASS	.	.	.
chr1	836291	sim_small_indel_60	AGTCCCTGGCTCCATGTGG	A	.	PASS	.	.	.
chr1	855681	sim_small_indel_61	ATGGCCAATCCGTCAG	A	.	PASS	.	.	.
chr1	859639	sim_small_indel_62	A	AAATTTCGTAGACTGACCATGACGGGACAAGCCCGGCCTCTGCGATC	.	PASS	.	.	.
chr1	869665	sim_small_indel_63	AAACAACCCTGCCATAGCAGCCCATCAGTCCTCTGAGACAGGTGAGGAA	A	.	PASS	.	.	.
chr1	875485	sim_small_indel_64	G	GCCGTATTAAAGGGGGGTATCGTGTGTGGTTCATAAAAGCGCGGCTGA	.	PASS	.	.	.
chr1	879789	sim_small_indel_65	T	TCCCCACTTGTGTTCCTCCAATTTGACGGAGTCACTTGTGAAAGCGG	.	PASS	.	.	.
chr1	888715	sim_small_indel_66	A	ACTTGACGACACCGCATATCACTTCCC	.	PASS	.	.	.
chr1	898531	sim_small_indel_67	T	TATCACATGTGCTTAGCACATCGAGCG	.	PASS	.	.	.
chr1	908621	sim_small_indel_68	G	GGTTAACATTTGTGGTCTCGCCTGTGA	.	PASS	.	.	.
chr1	913271	sim_small_indel_69	A	AAGTGAACTG	.	PASS	.	.	.
chr1	922037	sim_small_indel_70	CCAGAAACCCAGG	C	.	PASS	.	.	.
chr1	930498	sim_small_indel_71	CCTCATCTGCCCCCTGTCTGGCGTCCCCTCCCACCT	C	.	PASS	.	.	.
chr1	938349	sim_small_indel_72	G	GGGAAGCGGCGGTAGCGTGAT	.	PASS	.	.	.
chr1	972607	sim_small_indel_73	A	ATTAACCATACGATGGCT	.	PASS	.	.	.
chr1	974202	sim_small_indel_74	TGGGACTGGGGAGGGAGAGCAGGGGG	T	.	PASS	.	.	.
chr1	977470	sim_sv_indel_5	CCCCCACCAACCCCGGGAACCGCCTCCCGCTCCCCCCACCAACCCCGGGAACCGCCTCCCGCTCCCCCCACCAACCCCGGGAACCGCCTCCCGCTCCCCCCACCAACCCCGGGAACCGCCTCCCGCTCCC	C	.	PASS	SVTYPE=DEL;SVLEN=-129	.	.
chr1	982067	sim_small_indel_75	GGGTCCCGGCGGCCGAGCTGGGCGTCTGAGCTGCCCGTCCTGGGTCGG	G	.	PASS	.	.	.
chr1	982701	sim_small_indel_76	A	AGATTGGCATGACAGTAATAGCGACGTGGGGCAAGGGGCGTTGG	.	PASS	.	.	.
chr1	986429	sim_sv_indel_6	AGATACCTGGGGTTGACCGTGTGGACACATGCTTTCATTTCTCTTGGGTACATACTCGGAATTGGAGTGGCTAGATCATATGATAGGTATATGTTTAACTTTTTGAGAAACTGCCAGACCGTCTTCCAGTGTGGTTGTACAATTATTTATTCCCACTAGAGGTTCCAGGCTCCGTGTCCTCGTCCTCACCCACACTCAGCGCGGGACAGCTTTGTAATTCAGTCATCCCAGTAGGCGCTCATTCTGGTTTTAATTTGCATTTCCTTAATGATGAACAACACTGGGCATCTTCTCACGTGTGCATTTGTTATCCATGTATCTTCTTTGGTAAAAAGACTGATCAAATATTTTGTCCATTTTATTTTATTTGAGACAGAGCCTCACTCTGTTGCCCAGGCTGGAATGAAGCAGTGAGATCTTGGCTCACTGCAACCTCCACCCCCTGGGTTCAAGTGATTGTTCCACCTCAGCTTCCCGAGTAGCTGGGATTACAGGTGCTGACACCACACCTGGCT	A	.	PASS	SVTYPE=DEL;SVLEN=-514.
chr1	988195	sim_small_indel_77	CTATGATCTATTTTTAGTTAATTTG	C	.	PASS	.	.	.
chr1	992088	sim_small_indel_78	TAAACGTATGTTAGGCC	T	.	PASS	.	.	.
chr1	996668	sim_small_indel_79	CCCCGCGGGAGGGGGCACCTCACGTCTGGGGCCACA	C	.	PASS	.	.	.
chr1	1007989	sim_small_indel_80	C	CCGTGC	.	PASS	.	.	.

@marcelm
Copy link
Collaborator

marcelm commented Mar 6, 2023

Since I’m bad at reading raw CSVs, here are the result tables formatted in a nicer way.

SNVs

dataset tool recall precision f_score
BIO150 strobealign_v071 97.9 85.9 91.5
BIO150 strobealign_v080 97.9 85.9 91.5
BIO150 strobealign_main 97.8 85.9 91.4
BIO250 strobealign_v071 95.8 82.0 88.4
BIO250 strobealign_v080 95.9 81.9 88.4
BIO250 strobealign_main 95.8 81.8 88.3
SIM150 strobealign_v071 97.4 99.5 98.4
SIM150 strobealign_v080 97.4 99.5 98.4
SIM150 strobealign_main 97.3 99.5 98.4
SIM150 bwa_mem 97.9 99.6 98.7
SIM250 strobealign_v071 98.5 99.6 99.0
SIM250 strobealign_v080 98.5 99.6 99.0
SIM250 strobealign_main 98.5 99.5 99.0
SIM250 bwa_mem 98.9 99.5 99.2

Indels

dataset tool recall precision f_score
BIO150 strobealign_v071 8.0 5.5 6.6
BIO150 strobealign_v080 8.0 5.5 6.5
BIO150 strobealign_main 7.8 4.7 5.9
BIO250 strobealign_v071 7.8 6.2 6.9
BIO250 strobealign_v080 7.8 6.2 6.9
BIO250 strobealign_main 7.6 5.4 6.3
SIM150 strobealign_v071 49.6 53.6 51.5
SIM150 strobealign_v080 49.6 53.6 51.5
SIM150 strobealign_main 64.3 53.6 58.5
SIM150 bwa_mem 45.9 53.6 49.5
SIM250 strobealign_v071 50.5 53.9 52.1
SIM250 strobealign_v080 50.5 53.9 52.1
SIM250 strobealign_main 67.5 59.9 63.5
SIM250 bwa_mem 50.4 53.7 52.0
  • It’s weird that BWA-MEM has the same precision and recall as strobealign to within 0.1% accuracy until v0.8, but then suddenly strobealign’s numbers jump.
  • Why do we not see an improvement in the BIO datasets? Would you say it is because the ground truth is incomplete?

I need to reproduce your results on a small portion of the data (I’ll probably use a single chromosome and SIM150). First I’ll test whether /1 and /2 make a difference, but I doubt it (if it mattered, I would not expect the results to be so similar to BWA-MEM for <=0.8).

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 6, 2023

Why do we not see an improvement in the BIO datasets? Would you say it is because the ground truth is incomplete?

I don't know for sure, but I could guess. What I can say is that I only filtered the GIAB truth call dataset into SNP and Indels using zgrep -P "\t[ACGT]\t[ACGT]\t" for SNPs and zgrep -v -P "\t[ACGT]\t[ACGT]\t" for indels. The 'indels' therefore includes larger SV as well (basically anything but SNVs I would say). If a tiny indel prediction overlaps a large rearrangement or similar, it is still counted as an overlap, hence TP. This is something that could be done better. In the simulations, I split the variants according to a key field in the VCF that says 'INDEL', so should only be what mason classifies as a smaller indel and not larger SVs.

Edit: I realized that this response does not answer your question :) It only states that there is more uncertainty in the TP and FP estimates in the biological datasets. I would assume that the TP, hence Recall, would go up also for BIO, but it drops.. Btw, I am not sure I follow your boldfaced system in the table.

@marcelm
Copy link
Collaborator

marcelm commented Mar 6, 2023

Just commenting on this until I know more:

Btw, I am not sure I follow your boldfaced system in the table.

Sorry, forgot to write a sentence about that: The bold numbers are those that are somehow "interesting" in that they are different compared to the version preceding it.

@marcelm
Copy link
Collaborator

marcelm commented Mar 6, 2023

One difference between v0.8.0 and main appears to be that calling indels with main produces many duplicate indels like these:

chrEBV  6422    .       CAACACGGGTTCCCAGAGAGGGTAAAAGAGGGGGCCATAA        CAA
chrEBV  6424    .       ACACGGGTTCCCAGAGAGGGTAAAAGAGGGGGCCATAA  A

Here’s how it looks in IGV:
alignment-ebv

So v0.8.0 left-aligns indels, but main doesn’t.

On chrEBV, 15 indels are called from the v0.8.0 BAM, but 20 from the main BAM. All five extra indels are duplicates like the above.

(I keep getting confused by the "main" name because it isn’t main but actually the wfa branch.)

However, if there for some reason are many redundant (overlapping) indel calls overlapping with a true indel, I think a TP will be counted one time for each overlapping indel call with isec. Maybe main is having problem with this?

Assuming the above observation holds not only on chrEBV, I think that is the issue!

@marcelm
Copy link
Collaborator

marcelm commented Mar 6, 2023

With #251, where I added a fix to ensure that indels are left-aligned, the alignments for the above example look as they did in v0.8.0.

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 6, 2023

Great that you found the(/a) cause for the drastic difference! I am however a bit surprised. In the 50 first indel calls I pasted above for bwa mem main and v0.8.0, there were a lot of calls uniquely detected by main that did not seem to overlap other calls and where also correct calls (8/50 which is roughly 15% - the increase in recall). Was this just a very unlikely coincidence? Did you have a look at them?

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 6, 2023

Let me know if you want me to rerun a new commit as 'main' on our SIM and BIO datasets to check if it goes back to previous performance.

@marcelm
Copy link
Collaborator

marcelm commented Mar 8, 2023

I manually looked at the first variant (chr1:535531 A→AGTGAA) that you marked as "not found by BWA". The alignments covering that variant look fine for all programs (visual inspection in IGV), and the variant is called for all tools. What is different is the way the variant is represented.

In the simulation VCF, the variants is written like this:

chr1    535531  sim_small_indel_36  A   AGTGAA ...

In the BWA-MEM, strobealign v0.7.1 and strobealign v0.8.0 VCF result files, the variant looks like this in the resulting VCF:

chr1    535529  .   AAA AAAGTGAA ...

In the strobealign "main" (wfa) results file, it is represented like this:

chr1    535531  .   A   AGTGAA

It’s the same variant in all cases.

I think this can be taken care of by running the VCFs through bcftools norm before comparing them. Then recall and precision for all programs will go up.

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 8, 2023

Excellent that you tracked down the cause. It looked to good to be true :) I will integrate bcftools norm as soon as I have time and rerun the evaluation. Maybe that will give more stable feedback on the BIO results as well.

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 28, 2023

I integrated bcftools norm -c s into the evaluation pipeline. I ran bcftools norm on the calls for all aligners, precision recall values for INDEL calls below. The 'main' that was evaluated here has much lower precision (because of the duplicate indel calls) so I consider that mystery solved/explained. However, it also has lower recall on BIO data.

Since a lot has happened I suggest to simply rerun the benchmark for v0.7.1, v0.8.0, v0.9.0, and commit 8311650

BIO150,strobealign_v071,INDEL,89.1,63.8,74.3
BIO150,strobealign_v080_cmake,INDEL,89.3,63.5,74.2
BIO150,strobealign_main,INDEL,83.9,50.8,63.3

BIO250,strobealign_v071,INDEL,82.7,68.0,74.6
BIO250,strobealign_v080_cmake,INDEL,82.7,67.5,74.3
BIO250,strobealign_main,INDEL,76.9,54.5,63.8

SIM150,strobealign_v071,INDEL,67.2,72.7,69.8
SIM150,strobealign_v080_cmake,INDEL,67.2,72.7,69.8
SIM150,strobealign_main,INDEL,67.2,56.0,61.0

SIM250,strobealign_v071,INDEL,68.4,73.0,70.6
SIM250,strobealign_v080_cmake,INDEL,68.4,73.0,70.6
SIM250,strobealign_main,INDEL,68.4,60.7,64.4

@ksahlin
Copy link
Owner Author

ksahlin commented Mar 28, 2023

I have the results for the benchmark on benchmark for v0.7.1, v0.8.0, v0.9.0, and commit 8311650 (called strobealign_wfa2). Perhaps we should close this one and start a new issue? Anyways, below I have pasted the new results.

Some significant remarks:

  • strobealign_wfa2 has very low INDEL recall and precision. Because the values are so off there is probably because of something significant like not left aligning or similar.
  • strobealign_v0.9.0 has a fairly significant decrease in F-score in SNV calling to v0.7.1 and v0.8.0 across all datasets. Specifically, v0.9.0 has higher recall but still loses in F-score because of significantly lower precision. (If we have verified that accuracy is better, perhaps this is a MAPQ value issue?)
  • INDEL calling looks relatively stable between v0.7.1, v0.8.0 and v0.9.0.

(runtime is pasted at bottom)

Edit: I also checked BWA MEMs results and it typically has higher recall than strobealign and much worse precision (see added lines below, so maybe strobealign_v0.9.0 is 'moving in the right direction' (assuming bwa mem does a good job)?

dataset,tool,variant,recall,precision,f_score
#BIO150
BIO150,bwa_mem,SNV,98.9,80.5,88.8
BIO150,strobealign_v071,SNV,97.9,85.9,91.5
BIO150,strobealign_v080_cmake,SNV,97.9,85.9,91.5
BIO150,strobealign_v090_cmake,SNV,98.3,83.9,90.5
BIO150,strobealign_wfa2,SNV,98.2,83.7,90.3

BIO150,bwa_mem,INDEL,90.1,62.5,73.8
BIO150,strobealign_v071,INDEL,89.1,63.8,74.3
BIO150,strobealign_v080_cmake,INDEL,89.3,63.5,74.2
BIO150,strobealign_v090_cmake,INDEL,89.4,63.7,74.4
BIO150,strobealign_wfa2,INDEL,84.4,48.8,61.8

#BIO250
BIO250,bwa_mem,SNV,97.7,76.7,85.9
BIO250,strobealign_v071,SNV,95.8,82.0,88.4
BIO250,strobealign_v080_cmake,SNV,95.9,81.9,88.4
BIO250,strobealign_v090_cmake,SNV,96.4,80.7,87.8
BIO250,strobealign_wfa2,SNV,96.3,80.5,87.7

BIO250,bwa_mem,INDEL,84.1,67.3,74.7
BIO250,strobealign_v071,INDEL,82.7,68.0,74.6
BIO250,strobealign_v080_cmake,INDEL,82.7,67.5,74.3
BIO250,strobealign_v090_cmake,INDEL,82.7,67.5,74.4
BIO250,strobealign_wfa2,INDEL,78.0,52.4,62.7

#SIM150
SIM150,bwa_mem,SNV,97.9,99.6,98.7
SIM150,strobealign_v071,SNV,97.4,99.5,98.4
SIM150,strobealign_v080_cmake,SNV,97.4,99.5,98.4
SIM150,strobealign_v090_cmake,SNV,97.5,97.7,97.6
SIM150,strobealign_wfa2,SNV,97.6,97.5,97.5

SIM150,bwa_mem,INDEL,62.2,72.6,67.0
SIM150,strobealign_v071,INDEL,67.2,72.7,69.8
SIM150,strobealign_v080_cmake,INDEL,67.2,72.7,69.8
SIM150,strobealign_v090_cmake,INDEL,67.2,72.7,69.8
SIM150,strobealign_wfa2,INDEL,67.3,60.5,63.8

#SIM250
SIM250,bwa_mem,SNV,98.9,99.5,99.2
SIM250,strobealign_v071,SNV,98.5,99.6,99.0
SIM250,strobealign_v080_cmake,SNV,98.5,99.6,99.0
SIM250,strobealign_v090_cmake,SNV,98.6,98.1,98.4
SIM250,strobealign_wfa2,SNV,98.6,98.0,98.3

SIM250,bwa_mem,INDEL,68.4,72.8,70.5
SIM250,strobealign_v071,INDEL,68.4,73.0,70.6
SIM250,strobealign_v080_cmake,INDEL,68.4,73.0,70.6
SIM250,strobealign_v090_cmake,INDEL,68.4,73.0,70.6
SIM250,strobealign_wfa2,INDEL,68.5,59.4,63.6 

Points on runtime

  • v0.9.0 is the fastest - a really nice speedup to v0.7.1 (although reported runtimes of v0.7.1 in paper are slightly faster - probably because this v0.7.1 is the conda installation).
  • the wfa2+ksw2 extension is slower on BIO datasets but faster on SIM.
tool,dataset,read_length,time,memory

strobealign_v071,BIO150,150,4519.341,32.461476
strobealign_v080_cmake,BIO150,150,4057.6,22.373084
strobealign_v090_cmake,BIO150,150,3939.74,22.373228
strobealign_wfa2,BIO150,150,4320.05,22.372988

strobealign_v071,BIO250,250,2307.649,33.469112
strobealign_v080_cmake,BIO250,250,2069.56,22.302748
strobealign_v090_cmake,BIO250,250,2064.28,22.302692
strobealign_wfa2,BIO250,250,2126.21,22.302788


strobealign_v071,SIM150,150,3090.409,32.214192
strobealign_v080_cmake,SIM150,150,2774.6400000000003,22.373096
strobealign_v090_cmake,SIM150,150,2482.69,22.373132
strobealign_wfa2,SIM150,150,2465.12,22.37318

strobealign_v071,SIM250,250,2503.663,33.415296
strobealign_v080_cmake,SIM250,250,2014.4299999999998,22.302704
strobealign_v090_cmake,SIM250,250,1968.02,22.30278
strobealign_wfa2,SIM250,250,1756.85,22.302736

@marcelm
Copy link
Collaborator

marcelm commented Mar 29, 2023

Re-posting the accuracy table so it’s possibly easier to digest:

dataset variant tool recall precision f_score
BIO150 SNV bwa_mem 98.9 80.5 88.8
BIO150 SNV strobealign_v071 97.9 85.9 91.5
BIO150 SNV strobealign_v080_cmake 97.9 85.9 91.5
BIO150 SNV strobealign_v090_cmake 98.3 83.9 90.5
BIO150 SNV strobealign_wfa2 98.2 83.7 90.3
BIO150 INDEL bwa_mem 90.1 62.5 73.8
BIO150 INDEL strobealign_v071 89.1 63.8 74.3
BIO150 INDEL strobealign_v080_cmake 89.3 63.5 74.2
BIO150 INDEL strobealign_v090_cmake 89.4 63.7 74.4
BIO150 INDEL strobealign_wfa2 84.4 48.8 61.8
BIO250 SNV bwa_mem 97.7 76.7 85.9
BIO250 SNV strobealign_v071 95.8 82.0 88.4
BIO250 SNV strobealign_v080_cmake 95.9 81.9 88.4
BIO250 SNV strobealign_v090_cmake 96.4 80.7 87.8
BIO250 SNV strobealign_wfa2 96.3 80.5 87.7
BIO250 INDEL bwa_mem 84.1 67.3 74.7
BIO250 INDEL strobealign_v071 82.7 68.0 74.6
BIO250 INDEL strobealign_v080_cmake 82.7 67.5 74.3
BIO250 INDEL strobealign_v090_cmake 82.7 67.5 74.4
BIO250 INDEL strobealign_wfa2 78.0 52.4 62.7
SIM150 SNV bwa_mem 97.9 99.6 98.7
SIM150 SNV strobealign_v071 97.4 99.5 98.4
SIM150 SNV strobealign_v080_cmake 97.4 99.5 98.4
SIM150 SNV strobealign_v090_cmake 97.5 97.7 97.6
SIM150 SNV strobealign_wfa2 97.6 97.5 97.5
SIM150 INDEL bwa_mem 62.2 72.6 67.0
SIM150 INDEL strobealign_v071 67.2 72.7 69.8
SIM150 INDEL strobealign_v080_cmake 67.2 72.7 69.8
SIM150 INDEL strobealign_v090_cmake 67.2 72.7 69.8
SIM150 INDEL strobealign_wfa2 67.3 60.5 63.8
SIM250 SNV bwa_mem 98.9 99.5 99.2
SIM250 SNV strobealign_v071 98.5 99.6 99.0
SIM250 SNV strobealign_v080_cmake 98.5 99.6 99.0
SIM250 SNV strobealign_v090_cmake 98.6 98.1 98.4
SIM250 SNV strobealign_wfa2 98.6 98.0 98.3
SIM250 INDEL bwa_mem 68.4 72.8 70.5
SIM250 INDEL strobealign_v071 68.4 73.0 70.6
SIM250 INDEL strobealign_v080_cmake 68.4 73.0 70.6
SIM250 INDEL strobealign_v090_cmake 68.4 73.0 70.6
SIM250 INDEL strobealign_wfa2 68.5 59.4 63.6

@ksahlin
Copy link
Owner Author

ksahlin commented Apr 23, 2023

I evaluated commit b5305a0 which essentially is v0.9.0 with M as cigar strings instead of =/X, which we implemented after discovering that bcftools mpilup behaves differently for M vs =/X as poster here.

Results for b5305a0 are given in table below.

With M cigar strings:

  • Precision of SNV calling is much higher on the SIM datasets compared to v0.9.0. For SIM150, precision went up to 99.5 which is similar to v0.7.1 and v0.8.0. For SIM250, precision went up to 99.4 which is also similar to v0.7.1 and v0.8.0.
  • Recall on the BIO250 dataset is much higher making up for the decrease in precision. But recall on BIO150 stayed the same.
  • INDEL calling on all the datasets are the same as v0.9.0.

So it turns out the cigars were responsible for most of the decrease in precision. Overall I think we can close this issue.

dataset variant tool recall precision f_score
BIO150 strobealign_b5305a0 SNV 98.3 83.7 90.4
BIO150 strobealign_b5305a0 INDEL 89.4 63.7 74.4
BIO250 strobealign_b5305a0 SNV 97.3 78.6 87.0
BIO250 strobealign_b5305a0 INDEL 82.7 67.5 74.4
SIM150 strobealign_b5305a0 SNV 97.6 99.5 98.5
SIM150 strobealign_b5305a0 INDEL 67.2 72.7 69.8
SIM250 strobealign_b5305a0 SNV 98.6 99.4 99.0
SIM250 strobealign_b5305a0 INDEL 68.4 73.0 70.6

@marcelm
Copy link
Collaborator

marcelm commented Apr 24, 2023

Ok, then let’s close this. This issue has become a potpourri a various not fully related things anyway. If anything reappears, it’s probably best to open another issue.

@marcelm marcelm closed this as completed Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants