Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many reads filtered out? #77

Open
melop opened this issue Jul 5, 2020 · 3 comments
Open

Too many reads filtered out? #77

melop opened this issue Jul 5, 2020 · 3 comments

Comments

@melop
Copy link

melop commented Jul 5, 2020

Hello, I previous used a set of mate pair libraries to scaffold Allpath-LG scaffolds and was quite successful. Now I got a new assembly from nanopore contigs, and try to apply the same scaffolding procedure.
However it looks like most of the reads were discarded. What is a reason for this?

Thanks
Ray

Statistics.txt

Initial number of contigs: 48716.
Number of contigs discarded from further analysis (with -filter_contigs set to 10): 1
Time elapsed for reading in contig sequences:7.42810487747

PASS 1

-T 7107.0 -t 5673.0
Contamine mean before filtering : 3169.82170245
Contamine stddev before filtering: 22984.3175557
Contamine mean converged: 677.939688523
Contamine std_est converged: 1121.3850919

LIBRARY STATISTICS
Mean of library set to: 2805.0
Standard deviation of library set to: 717.0
MP library PE contamination:
Contamine rate (rev comp oriented) estimated to: False
lib contamine mean (avg fragmentation size): 0
lib contamine stddev: 0
Number of contamined reads used for this calculation: 10081.0
-T (library insert size threshold) set to: 7107.0
-k set to (Scaffolding with contigs larger than): 5673.0
Number of links required to create an edge: None
Maximum identical contig-end overlap-length to merge of contigs that are adjacent in a scaffold: 200
Read length set to: 62.41

Time elapsed for getting libmetrics, iteration 0: 2.3990881443

Parsing BAM file...
L50: 2662 N50: 126232 Initial contig assembly length: 1505017490
Time initializing BESST objects: 0.231798887253
Total time elapsed for initializing Graph: 0.617565870285
Reading bam file and creating scaffold graph...
ELAPSED reading file: 6581.09070301
NR OF FISHY READ LINKS: 139654
Number of USEFUL READS (reads mapping to different contigs uniquly): 338778484
Number of non unique reads (at least one read non-unique in read pair) that maps to different contigs (filtered out from scaffolding): 478966897
Reads with too large insert size from "USEFUL READS" (filtered out): 304923304
Initial number of edges in G (the graph with large contigs): 858809
Initial number of edges in G_prime (the full graph of all contigs before removal of repats): 2204910
Number of duplicated reads indicated and removed: 26299652
Mean coverage before filtering out extreme observations = 150.34412238
Std dev of coverage before filtering out extreme observations= 888.146107052
Mean coverage after filtering = 0.0386320692138
Std coverage after filtering = 0.0212291479729
Length of longest contig in calc of coverage: 1578503
Length of shortest contig in calc of coverage: 5673
Detecting repeats..
Removed a total of: 43707 repeats. With coverage larger than 0.13706564204
Number of edges in G (after repeat removal): 1503
Number of edges in G_prime (after repeat removal): 5008
Number of BWA buggy edges removed: 0
Number of edges in G (after filtering for buggy flag stats reporting): 1503
Number of edges in G_prime (after filtering for buggy flag stats reporting): 5008
Letting filtering threshold in high complexity regions be 5 for this library.
Letting -e be 5 for this library.
Removed 0 edges from graph G of border contigs.
Remove edges in high complexity areas.
Removed total of 0 edges in high density areas.
Removed an additional of 0 edges with low support from full graph G_prime of all contigs.
Number of significantly spurious edges: 0
Number of edges in G_prime (after removing edges under -e threshold (if not specified, default is -e 3): 5008


Nr of contigs/scaffolds included in this pass: 5008
Out of which 1503 acts as border contigs.
Total time for CreateGraph-module, iteration 0: 6599.10073209

0 link edges created.
Perform inference on scaffold graph...
Remove isolated nodes.
1503 isolated contigs removed from graph.

Searching for paths BETWEEN scaffolds

Entering ELS.BetweenScaffolds single core
iterating until maximum of 0 extensions.
Number of nodes:10016, Number of edges: 5008
Elapsed time single core pathfinder: 0.0146651268005
0 paths detected are with score greater or equal to 1.5
Nr of contigs left: 5008.0 Nr of linking edges left: 0.0
Number of gaps estimated by GapEst-LP module order_contigs in this step is: 0
Time elapsed for making scaffolds, iteration 0: 0.152600049973

(super)Contigs after scaffolding: 5008

param value
detect_haplotype False
hit_path_threshold False
lognormal False
orientation rf
gap_estimations []
hapl_threshold 3
gff_file None
lower_cov_cutoff 0
path_gaps_estimated 0
expected_links_over_mean_plus_stddev 5
read_len 62.41
pass_number 1
path_threshold 100000
std_dev_coverage 0.0212291479729
mean_coverage 0.0386320692138
detect_duplicate True
FASTER_ILP False
development False
std_dev_ins_size 717.0
NO_ILP False
current_N50 126232
print_scores False
mean_ins_size 2805.0
multiprocess False
scaffold_indexer 48716
hapl_ratio 1.3
no_score True
first_lib True
current_L50 2662
plots False
contigfile None
cov_cutoff None
contamination_ratio False
ins_size_threshold 7107.0
edgesupport 5
extend_paths True
tot_assembly_length 1505017490
max_extensions None
score_cutoff 1.5
min_mapq 20
information_file <open file 'scaffold//BESST_output/Statistics.txt', mode 'w' at 0x7f6549be1300>
contamination_mean 0
max_contig_overlap 200
contig_threshold 6958
contamination_stddev 0
dfs_traversal True

PASS 2

-T 8421.0 -t 7243.0
Contamine mean before filtering : 644.91884719
Contamine stddev before filtering: 7618.78938119
Contamine mean converged: 323.448900031
Contamine std_est converged: 136.579587956

LIBRARY STATISTICS
Mean of library set to: 4887.0
Standard deviation of library set to: 589.0
MP library PE contamination:
Contamine rate (rev comp oriented) estimated to: 0.220716849845
lib contamine mean (avg fragmentation size): 323.448900031
lib contamine stddev: 136.579587956
Number of contamined reads used for this calculation: 97730.0
-T (library insert size threshold) set to: 8421.0
-k set to (Scaffolding with contigs larger than): 7243.0
Number of links required to create an edge: None
Maximum identical contig-end overlap-length to merge of contigs that are adjacent in a scaffold: 200
Read length set to: 48.78

Time elapsed for getting libmetrics, iteration 1: 2.95900011063

Parsing BAM file...
L50: 0 N50: 0 Initial contig assembly length: 1505017490
Nr of contigs/scaffolds that was singeled out due to length constraints 368
Time cleaning BESST objects for next library: 0.00483298301697
Total time elapsed for initializing Graph: 0.0218350887299
Reading bam file and creating scaffold graph...
ELAPSED reading file: 325.734697104
NR OF FISHY READ LINKS: 0
Number of USEFUL READS (reads mapping to different contigs uniquly): 0
Number of non unique reads (at least one read non-unique in read pair) that maps to different contigs (filtered out from scaffolding): 0
Reads with too large insert size from "USEFUL READS" (filtered out): 0
Initial number of edges in G (the graph with large contigs): 0
Initial number of edges in G_prime (the full graph of all contigs before removal of repats): 5008
Number of duplicated reads indicated and removed: 0
Mean coverage before filtering out extreme observations = 0.00857007310694
Std dev of coverage before filtering out extreme observations= 0.0171089338111
Mean coverage after filtering = 9.06876577268e-05
Std coverage after filtering = 0.000461890485611
Length of longest contig in calc of coverage: 89136
Length of shortest contig in calc of coverage: 7243
Number of edges in G (after repeat removal): 0
Number of edges in G_prime (after repeat removal): 5008
Number of BWA buggy edges removed: 0
Number of edges in G (after filtering for buggy flag stats reporting): 0
Number of edges in G_prime (after filtering for buggy flag stats reporting): 5008
Letting filtering threshold in high complexity regions be 5 for this library.
Letting -e be 5 for this library.
Removed 0 edges from graph G of border contigs.
Remove edges in high complexity areas.
Removed total of 0 edges in high density areas.
Removed an additional of 0 edges with low support from full graph G_prime of all contigs.
Number of edges in G_prime (after removing edges under -e threshold (if not specified, default is -e 3): 5008


Nr of contigs/scaffolds included in this pass: 5008
Out of which 1135 acts as border contigs.
Total time for CreateGraph-module, iteration 1: 325.839869976

0 link edges created.
Perform inference on scaffold graph...
Remove isolated nodes.
0 isolated contigs removed from graph.

Searching for paths BETWEEN scaffolds

Entering ELS.BetweenScaffolds single core
iterating until maximum of 0 extensions.
Number of nodes:10016, Number of edges: 5008
Elapsed time single core pathfinder: 0.0115258693695
0 paths detected are with score greater or equal to 1.5
Nr of contigs left: 5008.0 Nr of linking edges left: 0.0
Number of gaps estimated by GapEst-LP module order_contigs in this step is: 0
Time elapsed for making scaffolds, iteration 1: 0.149516105652

(super)Contigs after scaffolding: 5008

param value
detect_haplotype False
hit_path_threshold False
lognormal False
orientation rf
gap_estimations []
hapl_threshold 3
gff_file None
lower_cov_cutoff 0
path_gaps_estimated 0
expected_links_over_mean_plus_stddev 5
read_len 48.78
pass_number 2
path_threshold 100000
std_dev_coverage 0.000461890485611
mean_coverage 9.06876577268e-05
detect_duplicate True
FASTER_ILP False
development False
std_dev_ins_size 589.0
NO_ILP False
current_N50 0
print_scores False
mean_ins_size 4887.0
multiprocess False
scaffold_indexer 48716
hapl_ratio 1.3
no_score True
first_lib False
current_L50 0
plots False
contigfile None
cov_cutoff None
contamination_ratio 0.220716849845
ins_size_threshold 8421.0
edgesupport 5
extend_paths True
tot_assembly_length 1505017490
max_extensions None
score_cutoff 1.5
min_mapq 20
information_file <open file 'scaffold//BESST_output/Statistics.txt', mode 'w' at 0x7f6549be1300>
contamination_mean 323.448900031
max_contig_overlap 200
contig_threshold 6958
contamination_stddev 136.579587956
dfs_traversal True

PASS 3

-T 13492.0 -t 11160.0
Contamine mean before filtering : 24633.7478754
Contamine stddev before filtering: 89658.0332439
Contamine mean converged: 6422.97819315
Contamine std_est converged: 4321.78894197

LIBRARY STATISTICS
Mean of library set to: 6496.0
Standard deviation of library set to: 1166.0
MP library PE contamination:
Contamine rate (rev comp oriented) estimated to: False
lib contamine mean (avg fragmentation size): 0
lib contamine stddev: 0
Number of contamined reads used for this calculation: 321.0
-T (library insert size threshold) set to: 13492.0
-k set to (Scaffolding with contigs larger than): 11160.0
Number of links required to create an edge: None
Maximum identical contig-end overlap-length to merge of contigs that are adjacent in a scaffold: 200
Read length set to: 168.11

Time elapsed for getting libmetrics, iteration 2: 3.24759888649

Parsing BAM file...
L50: 0 N50: 0 Initial contig assembly length: 1505017490
Nr of contigs/scaffolds that was singeled out due to length constraints 486
Time cleaning BESST objects for next library: 0.00424909591675
Total time elapsed for initializing Graph: 0.0218479633331
Reading bam file and creating scaffold graph...
ELAPSED reading file: 25.9454369545
NR OF FISHY READ LINKS: 0
Number of USEFUL READS (reads mapping to different contigs uniquly): 0
Number of non unique reads (at least one read non-unique in read pair) that maps to different contigs (filtered out from scaffolding): 0
Reads with too large insert size from "USEFUL READS" (filtered out): 0
Initial number of edges in G (the graph with large contigs): 0
Initial number of edges in G_prime (the full graph of all contigs before removal of repats): 5008
Number of duplicated reads indicated and removed: 0
Mean coverage before filtering out extreme observations = 0.00124294772942
Std dev of coverage before filtering out extreme observations= 0.00507329288087
Mean coverage after filtering = 0.00124294772942
Std coverage after filtering = 0.00507329288087
Length of longest contig in calc of coverage: 89136
Length of shortest contig in calc of coverage: 11160
Number of edges in G (after repeat removal): 0
Number of edges in G_prime (after repeat removal): 5008
Number of BWA buggy edges removed: 0
Number of edges in G (after filtering for buggy flag stats reporting): 0
Number of edges in G_prime (after filtering for buggy flag stats reporting): 5008
Letting filtering threshold in high complexity regions be 5 for this library.
Letting -e be 5 for this library.
Removed 0 edges from graph G of border contigs.
Remove edges in high complexity areas.
Removed total of 0 edges in high density areas.
Removed an additional of 0 edges with low support from full graph G_prime of all contigs.
Number of edges in G_prime (after removing edges under -e threshold (if not specified, default is -e 3): 5008


Nr of contigs/scaffolds included in this pass: 5008
Out of which 649 acts as border contigs.
Total time for CreateGraph-module, iteration 2: 26.0430119038

0 link edges created.
Perform inference on scaffold graph...
Remove isolated nodes.
0 isolated contigs removed from graph.

Searching for paths BETWEEN scaffolds

Entering ELS.BetweenScaffolds single core
iterating until maximum of 0 extensions.
Number of nodes:10016, Number of edges: 5008
Elapsed time single core pathfinder: 0.0116968154907
0 paths detected are with score greater or equal to 1.5
Nr of contigs left: 5008.0 Nr of linking edges left: 0.0
Number of gaps estimated by GapEst-LP module order_contigs in this step is: 0
Time elapsed for making scaffolds, iteration 2: 0.221040964127

(super)Contigs after scaffolding: 5008

param value
detect_haplotype False
hit_path_threshold False
lognormal False
orientation rf
gap_estimations []
hapl_threshold 3
gff_file None
lower_cov_cutoff 0
path_gaps_estimated 0
expected_links_over_mean_plus_stddev 5
read_len 168.11
pass_number 3
path_threshold 100000
std_dev_coverage 0.00507329288087
mean_coverage 0.00124294772942
detect_duplicate True
FASTER_ILP False
development False
std_dev_ins_size 1166.0
NO_ILP False
current_N50 0
print_scores False
mean_ins_size 6496.0
multiprocess False
scaffold_indexer 48716
hapl_ratio 1.3
no_score True
first_lib False
current_L50 0
plots False
contigfile None
cov_cutoff None
contamination_ratio False
ins_size_threshold 13492.0
edgesupport 5
extend_paths True
tot_assembly_length 1505017490
max_extensions None
score_cutoff 1.5
min_mapq 20
information_file <open file 'scaffold//BESST_output/Statistics.txt', mode 'w' at 0x7f6549be1300>
contamination_mean 0
max_contig_overlap 200
contig_threshold 6958
contamination_stddev 0
dfs_traversal True

PASS 4

-T 42537.0 -t 32239.0
Contamine mean before filtering : 29519.8535565
Contamine stddev before filtering: 68616.3455717
Contamine mean converged: 16048.5931953
Contamine std_est converged: 7824.95943874

LIBRARY STATISTICS
Mean of library set to: 11643.0
Standard deviation of library set to: 5149.0
MP library PE contamination:
Contamine rate (rev comp oriented) estimated to: False
lib contamine mean (avg fragmentation size): 0
lib contamine stddev: 0
Number of contamined reads used for this calculation: 676.0
-T (library insert size threshold) set to: 42537.0
-k set to (Scaffolding with contigs larger than): 32239.0
Number of links required to create an edge: None
Maximum identical contig-end overlap-length to merge of contigs that are adjacent in a scaffold: 200
Read length set to: 152.81

Time elapsed for getting libmetrics, iteration 3: 3.09710383415

Parsing BAM file...
L50: 0 N50: 0 Initial contig assembly length: 1505017490
Nr of contigs/scaffolds that was singeled out due to length constraints 589
Time cleaning BESST objects for next library: 0.00451493263245
Total time elapsed for initializing Graph: 0.0214760303497
Reading bam file and creating scaffold graph...
ELAPSED reading file: 16.8508169651
NR OF FISHY READ LINKS: 0
Number of USEFUL READS (reads mapping to different contigs uniquly): 0
Number of non unique reads (at least one read non-unique in read pair) that maps to different contigs (filtered out from scaffolding): 0
Reads with too large insert size from "USEFUL READS" (filtered out): 0
Initial number of edges in G (the graph with large contigs): 0
Initial number of edges in G_prime (the full graph of all contigs before removal of repats): 5008
Number of duplicated reads indicated and removed: 0
Mean coverage before filtering out extreme observations = 0.00143064790959
Std dev of coverage before filtering out extreme observations= 0.00381765871116
Mean coverage after filtering = 0.00143064790959
Std coverage after filtering = 0.00381765871116
Length of longest contig in calc of coverage: 89136
Length of shortest contig in calc of coverage: 32418
Number of edges in G (after repeat removal): 0
Number of edges in G_prime (after repeat removal): 5008
Number of BWA buggy edges removed: 0
Number of edges in G (after filtering for buggy flag stats reporting): 0
Number of edges in G_prime (after filtering for buggy flag stats reporting): 5008
Letting filtering threshold in high complexity regions be 5 for this library.
Letting -e be 5 for this library.
Removed 0 edges from graph G of border contigs.
Remove edges in high complexity areas.
Removed total of 0 edges in high density areas.
Removed an additional of 0 edges with low support from full graph G_prime of all contigs.
Number of edges in G_prime (after removing edges under -e threshold (if not specified, default is -e 3): 5008


Nr of contigs/scaffolds included in this pass: 5008
Out of which 60 acts as border contigs.
Total time for CreateGraph-module, iteration 3: 16.9412498474

0 link edges created.
Perform inference on scaffold graph...
Remove isolated nodes.
0 isolated contigs removed from graph.

Searching for paths BETWEEN scaffolds

Entering ELS.BetweenScaffolds single core
iterating until maximum of 0 extensions.
Number of nodes:10016, Number of edges: 5008
Elapsed time single core pathfinder: 0.0114350318909
0 paths detected are with score greater or equal to 1.5
Nr of contigs left: 5008.0 Nr of linking edges left: 0.0
Number of gaps estimated by GapEst-LP module order_contigs in this step is: 0
Time elapsed for making scaffolds, iteration 3: 0.63897395134

(super)Contigs after scaffolding: 5008

param value
detect_haplotype False
hit_path_threshold False
lognormal False
orientation rf
gap_estimations []
hapl_threshold 3
gff_file None
lower_cov_cutoff 0
path_gaps_estimated 0
expected_links_over_mean_plus_stddev 5
read_len 152.81
pass_number 4
path_threshold 100000
std_dev_coverage 0.00381765871116
mean_coverage 0.00143064790959
detect_duplicate True
FASTER_ILP False
development False
std_dev_ins_size 5149.0
NO_ILP False
current_N50 0
print_scores False
mean_ins_size 11643.0
multiprocess False
scaffold_indexer 48716
hapl_ratio 1.3
no_score True
first_lib False
current_L50 0
plots False
contigfile None
cov_cutoff None
contamination_ratio False
ins_size_threshold 42537.0
edgesupport 5
extend_paths True
tot_assembly_length 1505017490
max_extensions None
score_cutoff 1.5
min_mapq 20
information_file <open file 'scaffold//BESST_output/Statistics.txt', mode 'w' at 0x7f6549be1300>
contamination_mean 0
max_contig_overlap 200
contig_threshold 6958
contamination_stddev 0
dfs_traversal True

PASS 5

-T 333115.0 -t 259485.0

LIBRARY STATISTICS
Mean of library set to: 112225.0
Standard deviation of library set to: 36815.0
MP library PE contamination:
Contamine rate (rev comp oriented) estimated to: False
lib contamine mean (avg fragmentation size): 0
lib contamine stddev: 0
Number of contamined reads used for this calculation: 0.0
-T (library insert size threshold) set to: 333115.0
-k set to (Scaffolding with contigs larger than): 259485.0
Number of links required to create an edge: None
Maximum identical contig-end overlap-length to merge of contigs that are adjacent in a scaffold: 200
Read length set to: 719.6

Time elapsed for getting libmetrics, iteration 4: 0.472044944763

Parsing BAM file...
L50: 0 N50: 0 Initial contig assembly length: 1505017490
Nr of contigs/scaffolds that was singeled out due to length constraints 60
Time cleaning BESST objects for next library: 0.00434803962708
Total time for CreateGraph-module, iteration 4: 0.00948882102966

0 link edges created.
Perform inference on scaffold graph...
Remove isolated nodes.
0 isolated contigs removed from graph.

Searching for paths BETWEEN scaffolds

Entering ELS.BetweenScaffolds single core
iterating until maximum of 0 extensions.
Number of nodes:0, Number of edges: 0
Elapsed time single core pathfinder: 4.2200088501e-05
0 paths detected are with score greater or equal to 1.5
Nr of contigs left: 0.0 Nr of linking edges left: 0.0
Number of gaps estimated by GapEst-LP module order_contigs in this step is: 0
Time elapsed for making scaffolds, iteration 4: 5.06741690636

(super)Contigs after scaffolding: 5008

param value
detect_haplotype False
hit_path_threshold False
lognormal False
orientation fr
gap_estimations []
hapl_threshold 3
gff_file None
lower_cov_cutoff 0
path_gaps_estimated 0
expected_links_over_mean_plus_stddev 5
read_len 719.6
pass_number 5
path_threshold 100000
std_dev_coverage 0.00381765871116
mean_coverage 0.00143064790959
detect_duplicate True
FASTER_ILP False
development False
std_dev_ins_size 36815.0
NO_ILP False
current_N50 0
print_scores False
mean_ins_size 112225.0
multiprocess False
scaffold_indexer 48716
hapl_ratio 1.3
no_score True
first_lib False
current_L50 0
plots False
contigfile None
cov_cutoff None
contamination_ratio False
ins_size_threshold 333115.0
edgesupport None
extend_paths True
tot_assembly_length 1505017490
max_extensions None
score_cutoff 1.5
min_mapq 20
information_file <open file 'scaffold//BESST_output/Statistics.txt', mode 'w' at 0x7f6549be1300>
contamination_mean 0
max_contig_overlap 200
contig_threshold 6958
contamination_stddev 0
dfs_traversal True

L50: 0 N50: 0 Initial contig assembly length: 1505017490
Total time for scaffolding: 7012.52787113

@ksahlin
Copy link
Owner

ksahlin commented Jul 6, 2020

Hi Ray,

Looks like all contigs are filtered out due to highly variable coverage (or unstable algorithm in BESST to infer the mode of such distribution). To fix this, simply set -z 10000 10000 10000 10000 10000. This will ignore filtering out contigs with very high coverage for all the 5 libraries (you can set the values of these as desired, 10000 is just an example).

Let me know if this works.

Best,
K

@melop
Copy link
Author

melop commented Jul 7, 2020

Thank you for the quick reply!
Do you think it has something to do with me setting the min-mapq to 20? Do you think I should instead set this to 0 so that reads mapped to repetitive regions would also be considered?

Ray

@ksahlin
Copy link
Owner

ksahlin commented Jul 8, 2020

Sure, that might be a good idea to try! Let me know how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants