Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dorado demux unexplained unclassified read #1124

Open
GeorgescuC opened this issue Nov 8, 2024 · 6 comments
Open

Dorado demux unexplained unclassified read #1124

GeorgescuC opened this issue Nov 8, 2024 · 6 comments
Labels
barcode Issues related to barcoding

Comments

@GeorgescuC
Copy link

Issue Report

Please describe the issue:

Hi, I was testing different configurations to demux non-ONT barcodes in order to find optimal scoring and window sizes settings, and found inconsistencies in the number of classified/unclassified reads between kit configurations, so I started looking at some examples and found cases that I cannot explain.
The reads can have the barcode I am looking for on either side but they are the same sequences so the adapter and barcode sequences are the same. The barcodes are 7 bp long with >=3 edit distance and after noticing some reads where the barcode was slightly further inside the sequence than the 175bp rear_barcode_window I used at first, I tried increasing that to 250, which made some of the previously classified reads to become unclassified. In the example read used further down, the dorado verbose log indicates that this is due to both ends confidently predicting different barcodes, but the penalties for the barcodes given are 1 for the top strand vs 10 for the bottom strand while my min_barcode_penalty_dist is only 3.

Steps to reproduce the issue:

Run the dorado demux commands provided below with the files attached.

Run environment:

  • Dorado version: 0.8.2
  • Dorado command:
    • dorado-0.8.2-linux-x64/bin/dorado demux -vv --output-dir debug_new_unclassified_r250 debug_new_unclassified_r250.sam --barcode-arrangement kit_adapter_config_r250.toml --barcode-sequences kit_barcodes.fasta --no-trim
    • dorado-0.8.2-linux-x64/bin/dorado demux -vv --output-dir debug_new_unclassified debug_new_unclassified_r250.sam --barcode-arrangement kit_adapter_config.toml --barcode-sequences kit_barcodes.fasta --no-trim
  • Operating system: Ubuntu
  • Hardware (CPUs, Memory, GPUs): AMD Ryzen 9 7950X, 64GB RAM, NVIDIA RTX4090
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): SAM
  • Source data location (on device or networked drive - NFS, etc.): on device
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): 1 already basecalled read
  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue): the attached zip archive contains a 1 read SAM file, and the kit configuration files used.
    dorado_issue.zip

Logs

  • Log for the configuration that does not work as expected:
dorado-0.8.2-linux-x64/bin/dorado
[2024-11-07 17:29:08.380] [info] Running: "demux" "-vvv" "--output-dir" "debug_new_unclassified_r250" "debug_new_unclassified_r250.sam" "--barcode-arrangement" "kit_adapter_config_r250.toml" "--barcode-sequences" "kit_barcodes.fasta" "--no-trim"
[2024-11-07 17:29:08.380] [trace] AlignmentProcessingItems::initialise [ENTRY]
[2024-11-07 17:29:08.380] [trace] AlignmentProcessingItems::initialise [EXIT]
[2024-11-07 17:29:08.380] [info] num input files: 1
[2024-11-07 17:29:08.380] [debug] > barcoding threads 29, writer threads 3
[2024-11-07 17:29:08.416] [info] > starting barcode demuxing
[2024-11-07 17:29:08.416] [debug] Total reads processed: 1
[2024-11-07 17:29:08.416] [trace] pushed to pipeline: 1
[2024-11-07 17:29:08.418] [debug] > Kits to evaluate: 1
[2024-11-07 17:29:08.418] [trace] query cursor 29 target cursor 26
[2024-11-07 17:29:08.418] [trace] midstrand flank top dist 17 position 129 bc_loc 155 score 0.60465115
[2024-11-07 17:29:08.418] [trace]
CTACACGACGCTCTTCCGATCTNNNNNNNGCAATGAAGTCGCAGGGTTGG
| |*||||*| ||**| ||||*|||||||*|*||*||*||* |   ||||
C-AGACGAGG-TCAAC-GATCCCTCCCTTACCATCAAATCA-A---TTGG
[2024-11-07 17:29:08.418] [trace] query cursor 28 target cursor 27
[2024-11-07 17:29:08.418] [trace] midstrand flank bottom dist 16 position 261 bc_loc 288 score 0.627907
[2024-11-07 17:29:08.418] [trace]
CCAACCCTGCGACTTCATT-GCNNNNNNNAGATCG-GAAG-AGCGTCGTGTAG
*|*||| |||||||*| || ||||||||| ||||| |*|| | | ||**| |
GCGACC-TGCGACTCC-TTAGCATTTGAC-GATCGAGTAGTA-C-TCCCG-A-
[2024-11-07 17:29:08.418] [trace] query cursor 29 target cursor 23
[2024-11-07 17:29:08.418] [trace] top score dist 8 position 0 bc_loc 23 score 0.8139535
[2024-11-07 17:29:08.418] [trace]
CTACACGACGCTCTTCCGATCTNNNNNNNGCAATGAAGTCGCAGGGTTGG
  || |*|| ||*  |||||||||||||||||||||||||||||||||||
--AC-CAAC-CTA--CCGATCTACTACACGCAATGAAGTCGCAGGGTTGG
[2024-11-07 17:29:08.418] [trace] query cursor 28 target cursor 24
[2024-11-07 17:29:08.418] [trace] bottom score dist 19 position 29 bc_loc 53 score 0.55813956
[2024-11-07 17:29:08.418] [trace]
CCAACCCTGCGACTT-CATTGCNNNNNNNAGATCGGAAGAGCGTCGTGTAG
||||*||  || ||| ||*  |||||||| ||*|||  | | ||* |  |
CCAAACC--CG-CTTTCAC--CGCTACAC-GACCGG--G-G-GTA-T--A-
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc01
[2024-11-07 17:29:08.418] [trace] top window 1
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTACTACACGCAATGA
||*|||||||||||||||||||||
CTACCGATCTACTACACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 13
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCGTGTAGTAGATCGG-
* ||| ||**|| |**|*  ||*|||
CGCTTTCACCGC-TACAC--GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc02
[2024-11-07 17:29:08.418] [trace] top window 4
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTACTAGTA-GCAATGA
||*|||||||||||* | |||||||
CTACCGATCTACTAC-ACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 11
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCTACTAGTAGATCGG-
* ||| ||**||||| |*  ||*|||
CGCTTTCACCGCTAC-AC--GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc03
[2024-11-07 17:29:08.418] [trace] top window 4
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTAGTGTACGCAATGA
||*||||||||*|**|||||||||
CTACCGATCTACTACACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 10
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCGTACACTAGATCGG-
* ||| ||**|| |||||  ||*|||
CGCTTTCACCGC-TACAC--GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc04
[2024-11-07 17:29:08.418] [trace] top window 5
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTA-T-CACTAGCAATGA
||*|||||||| | |||  |||||||
CTACCGATCTACTACAC--GCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 12
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCTAGTGATAGATCGG-
* ||| ||**||||*  |* ||*|||
CGCTTTCACCGCTAC--AC-GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc05
[2024-11-07 17:29:08.418] [trace] top window 7
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCT-C-AGCTGTGCAATGA
||*||||||| | | |** |||||||
CTACCGATCTACTA-CAC-GCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 11
[2024-11-07 17:29:08.418] [trace]
ACTTCATTGCACAGCTG-A-GATCGG-
 |* |*|| |||*|||* | ||*|||
-CG-CTTT-CACCGCTACACGACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc06
[2024-11-07 17:29:08.418] [trace] top window 4
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTCAGT-CACGCAATGA
||*||||||| |*| ||||||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 12
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCGTGACTGAGATCGG-
* ||| ||**|| | ||** ||*|||
CGCTTTCACCGC-T-ACAC-GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc07
[2024-11-07 17:29:08.418] [trace] top window 6
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCT-CATGTATGCAATGA
||*||||||| | |**|*|||||||
CTACCGATCTAC-TACACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 11
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCATACATGAGATCGG-
* ||| ||**|| ||||*||*  |||
CGCTTTCACCGC-TACACGAC--CGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc08
[2024-11-07 17:29:08.418] [trace] top window 6
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCT-CGTATGTGCAATGA
||*||||||| | ||***|||||||
CTACCGATCTAC-TACACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 12
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGC-ACATACGAGATCGG-
* ||| ||**|| |||  |||*  |||
CGCTTTCACCGCTACA--CGAC--CGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc09
[2024-11-07 17:29:08.418] [trace] top window 6
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTGAC-ATGTGCAATGA
||*||||||| || |***|||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 12
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGC-ACATGTCAGATCGG-
* ||| ||**|| |||   | ||*|||
CGCTTTCACCGCTACA---C-GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc10
[2024-11-07 17:29:08.418] [trace] top window 6
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTGAGT-CTA-GCAATGA
||*||||||| |*| | | |||||||
CTACCGATCT-ACTAC-ACGCAATGA
[2024-11-07 17:29:08.418] [trace] bottom window 11
[2024-11-07 17:29:08.418] [trace]
A-CTT-CATTGCTAGACTCAGATCGG-
* ||| ||**||||*||   ||*|||
CGCTTTCACCGCTACAC---GACCGGG
[2024-11-07 17:29:08.418] [trace] Checking barcode kit_bc11
[2024-11-07 17:29:08.418] [trace] top window 6
[2024-11-07 17:29:08.418] [trace]
CTTCCGATCTG-TAGATAGCAATGA
||*|||||||* ||*|* |||||||
CTACCGATCTACTACAC-GCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 10
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCTATCTACAGATCGG-
* ||| ||**|||| | || ||*|||
CGCTTTCACCGCTA-C-AC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc12
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTG-TATGACGCAATGA
||*|||||||* ||* |||||||||
CTACCGATCTACTAC-ACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 11
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCGTCATACAGATCGG-
* ||| ||**|| | |*|| ||*|||
CGCTTTCACCGC-T-ACAC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc13
[2024-11-07 17:29:08.419] [trace] top window 7
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTG-TGTAGTGCAATGA
||*|||||||* |**|* |||||||
CTACCGATCTACTACAC-GCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 10
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCACTACACAGATCGG-
* ||| ||**||  ||||| ||*|||
CGCTTTCACCGC--TACAC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc14
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTTACTAGT-GCAATGA
||*||||||| ||||** |||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 12
[2024-11-07 17:29:08.419] [trace]
ACTTCATTGCACTAG-TA-A-GATCGG-
 |* |*|| |||* | || | ||*|||
-CG-CTTT-CACC-GCTACACGACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc15
[2024-11-07 17:29:08.419] [trace] top window 4
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTG-TACACTGCAATGA
||*|||||||* ||||| |||||||
CTACCGATCTACTACAC-GCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 11
[2024-11-07 17:29:08.419] [trace]
ACTTCATTGCAGTG-TACA-GATCGG-
 |* |*|| ||**| |||| ||*|||
-CG-CTTT-CACCGCTACACGACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc16
[2024-11-07 17:29:08.419] [trace] top window 6
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTTAGTG-ATGCAATGA
||*||||||| |*|* |*|||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 11
[2024-11-07 17:29:08.419] [trace]
ACTTCATTGCATCACTA-A-GATCGG-
 |* |*|| ||*|*||| | ||*|||
-CG-CTTT-CACCGCTACACGACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc17
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTAC-AGCTGGCAATGA
||*||||||||| | |**|||||||
CTACCGATCTACTA-CACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 12
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCCAGCTGTA-GATCGG-
* ||| ||   || |||**| ||*|||
CGCTTTCA---CC-GCTACACGACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc18
[2024-11-07 17:29:08.419] [trace] top window 6
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTG-TGACTGGCAATGA
||*|||||||* | ||**|||||||
CTACCGATCTACT-ACACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 11
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCCAGTCACAGATCGG-
* ||| ||**||*|  ||| ||*|||
CGCTTTCACCGCTA--CAC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc19
[2024-11-07 17:29:08.419] [trace] top window 4
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTA-TACATGGCAATGA
||*|||||||| ||||*| ||||||
CTACCGATCTACTACACG-CAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 13
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCCATGTATAGATCGG-
* ||| ||**||*|*  |* ||*|||
CGCTTTCACCGCTAC--AC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc20
[2024-11-07 17:29:08.419] [trace] top window 4
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTAC-ATACGGCAATGA
||*||||||||| |*||| ||||||
CTACCGATCTACTACACG-CAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 13
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCCGTATGTAGATCGG-
* ||| ||**||  ||*** ||*|||
CGCTTTCACCGC--TACAC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc21
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTAC-ATGTCGCAATGA
||*||||||||| |** ||||||||
CTACCGATCTACTACA-CGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 12
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCGACATGTAGATCGG-
* ||| ||**||*|||*| |*  |||
CGCTTTCACCGCTACACG-AC--CGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc22
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTTAG-ACTCGCAATGA
||*||||||| |* ||*||||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 12
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCGAGTCTAAGATCGG-
* ||| ||**||*|  | |*||*|||
CGCTTTCACCGCTA--C-ACGACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc23
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTTATCTAC--GCAATGA
||*||||||| | ||||  |||||||
CTACCGATCT-A-CTACACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 12
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCGTAGATAAGATCGG-
* ||| ||**|| ||*|*  ||*|||
CGCTTTCACCGC-TACAC--GACCGGG
[2024-11-07 17:29:08.419] [trace] Checking barcode kit_bc24
[2024-11-07 17:29:08.419] [trace] top window 5
[2024-11-07 17:29:08.419] [trace]
CTTCCGATCTG-TCATACGCAATGA
||*|||||||* | |*|||||||||
CTACCGATCTACT-ACACGCAATGA
[2024-11-07 17:29:08.419] [trace] bottom window 11
[2024-11-07 17:29:08.419] [trace]
A-CTT-CATTGCGTATGACAGATCGG-
* ||| ||**|| ||* || ||*|||
CGCTTTCACCGC-TAC-AC-GACCGGG
[2024-11-07 17:29:08.419] [trace] Two ends confidently predict different BCs: top bc kit_bc01, bottom bc kit_bc03
[2024-11-07 17:29:08.419] [trace] BC: unclassified
[2024-11-07 17:29:08.419] [trace] Barcode for 00016867-4590-42ec-b358-e4cf7ee81f13 is unclassified
[2024-11-07 17:29:08.529] [info] > Simplex reads basecalled: 1
[2024-11-07 17:29:08.529] [info] > 1 reads demuxed @ classifications/s: 1.000000e+01
[2024-11-07 17:29:08.529] [debug] Barcode distribution :
[2024-11-07 17:29:08.529] [debug] unclassified : 1
[2024-11-07 17:29:08.529] [debug] Classified rate 0%
[2024-11-07 17:29:08.529] [info] > finished barcode demuxing
  • Log for the configuration that works as expected:
dorado-0.8.2-linux-x64/bin/dorado
[2024-11-07 16:50:39.228] [info] Running: "demux" "-vv" "--output-dir" "debug_new_unclassified" "debug_new_unclassified_r250.sam" "--barcode-arrangement" "kit_adapter_config.toml" "--barcode-sequences" "kit_barcodes.fasta" "--no-trim"
[2024-11-07 16:50:39.228] [trace] AlignmentProcessingItems::initialise [ENTRY]
[2024-11-07 16:50:39.228] [trace] AlignmentProcessingItems::initialise [EXIT]
[2024-11-07 16:50:39.228] [info] num input files: 1
[2024-11-07 16:50:39.228] [debug] > barcoding threads 29, writer threads 3
[2024-11-07 16:50:39.259] [info] > starting barcode demuxing
[2024-11-07 16:50:39.259] [debug] Total reads processed: 1
[2024-11-07 16:50:39.259] [trace] pushed to pipeline: 1
[2024-11-07 16:50:39.261] [debug] > Kits to evaluate: 1
[2024-11-07 16:50:39.261] [trace] query cursor 29 target cursor 26
[2024-11-07 16:50:39.261] [trace] midstrand flank top dist 17 position 129 bc_loc 155 score 0.60465115
[2024-11-07 16:50:39.261] [trace]
CTACACGACGCTCTTCCGATCTNNNNNNNGCAATGAAGTCGCAGGGTTGG
| |*||||*| ||**| ||||*|||||||*|*||*||*||* |   ||||
C-AGACGAGG-TCAAC-GATCCCTCCCTTACCATCAAATCA-A---TTGG
[2024-11-07 16:50:39.261] [trace] query cursor 28 target cursor 27
[2024-11-07 16:50:39.261] [trace] midstrand flank bottom dist 16 position 261 bc_loc 288 score 0.627907
[2024-11-07 16:50:39.261] [trace]
CCAACCCTGCGACTTCATT-GCNNNNNNNAGATCG-GAAG-AGCGTCGTGTAG
*|*||| |||||||*| || ||||||||| ||||| |*|| | | ||**| |
GCGACC-TGCGACTCC-TTAGCATTTGAC-GATCGAGTAGTA-C-TCCCG-A-
[2024-11-07 16:50:39.261] [trace] query cursor 29 target cursor 23
[2024-11-07 16:50:39.261] [trace] top score dist 8 position 0 bc_loc 23 score 0.8139535
[2024-11-07 16:50:39.261] [trace]
CTACACGACGCTCTTCCGATCTNNNNNNNGCAATGAAGTCGCAGGGTTGG
  || |*|| ||*  |||||||||||||||||||||||||||||||||||
--AC-CAAC-CTA--CCGATCTACTACACGCAATGAAGTCGCAGGGTTGG
[2024-11-07 16:50:39.261] [trace] query cursor 28 target cursor 28
[2024-11-07 16:50:39.261] [trace] bottom score dist 19 position 38 bc_loc 66 score 0.55813956
[2024-11-07 16:50:39.261] [trace]
CCAAC--CCTGCGACTTCATTGCNNNNNNNAGATCG--GAAG-AGCGTCGTGTAG
|||*|  |||* ||*||*||| ||||||||| |||*  |||* ||*|*|* |||
CCATCGTCCTA-GAATTAATT-CCCCTAAAA-ATCTTTGAAATAGGGCCC-GTA-
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc01
[2024-11-07 16:50:39.261] [trace] top window 1
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTACTACACGCAATGA
||*|||||||||||||||||||||
CTACCGATCTACTACACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 13
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCGTGTAGTAGATCGG-
 |*||*|||*|** ||**| |||**
GAATTAATTCCCC-TAAAA-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc02
[2024-11-07 16:50:39.261] [trace] top window 4
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTACTAGTA-GCAATGA
||*|||||||||||* | |||||||
CTACCGATCTACTAC-ACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 12
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCTACTAGTAGATCGG-
 |*||*|||*|* |||**| |||**
GAATTAATTCCC-CTAAAA-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc03
[2024-11-07 16:50:39.261] [trace] top window 4
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTAGTGTACGCAATGA
||*||||||||*|**|||||||||
CTACCGATCTACTACACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 13
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCG-TACACTAGATCGG-
 |*||*|||*|* ||*|  | |||**
GAATTAATTCCCCTAAA--A-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc04
[2024-11-07 16:50:39.261] [trace] top window 5
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTA-T-CACTAGCAATGA
||*|||||||| | |||  |||||||
CTACCGATCTACTACAC--GCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 13
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCTAGTGATAGATCGG-
 |*||*|||*|** |*|*| |||**
GAATTAATTCCCC-TAAAA-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc05
[2024-11-07 16:50:39.261] [trace] top window 7
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCT-C-AGCTGTGCAATGA
||*||||||| | | |** |||||||
CTACCGATCTACTA-CAC-GCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 13
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCACAGCTGAGA-TCGG-
 |*||*|||*| |  ||*|*| ||**
GAATTAATTCC-C--CTAAAAATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc06
[2024-11-07 16:50:39.261] [trace] top window 4
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTCAGT-CACGCAATGA
||*||||||| |*| ||||||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 14
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCGTGACTGAGA-TCGG-
 |*||*|||*|*   ||*|*| ||**
GAATTAATTCCC---CTAAAAATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc07
[2024-11-07 16:50:39.261] [trace] top window 6
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCT-CATGTATGCAATGA
||*||||||| | |**|*|||||||
CTACCGATCTAC-TACACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 13
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCA-TACATGAGATCGG-
 |*||*|||*|* ||*|  | |||**
GAATTAATTCCCCTAAA--A-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc08
[2024-11-07 16:50:39.261] [trace] top window 6
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCT-CGTATGTGCAATGA
||*||||||| | ||***|||||||
CTACCGATCTAC-TACACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 12
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCACATACGAGATCGG-
 |*||*|||*|*| ||**| |||**
GAATTAATTCCCC-TAAAA-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc09
[2024-11-07 16:50:39.261] [trace] top window 6
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTGAC-ATGTGCAATGA
||*||||||| || |***|||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 13
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCACATGTCAGATCGG-
 |*||*|||*|*| |***| |||**
GAATTAATTCCCC-TAAAA-ATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc10
[2024-11-07 16:50:39.261] [trace] top window 6
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTGAGT-CTA-GCAATGA
||*||||||| |*| | | |||||||
CTACCGATCT-ACTAC-ACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 14
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCTAGACTCAGA-TCGG-
 |*||*|||*|*   ||*|*| ||**
GAATTAATTCCC---CTAAAAATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc11
[2024-11-07 16:50:39.261] [trace] top window 6
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTG-TAGATAGCAATGA
||*|||||||* ||*|* |||||||
CTACCGATCTACTACAC-GCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 12
[2024-11-07 16:50:39.261] [trace]
-ACTTCATTGCTATCTACAGATCGG-
 |*||*|||*|*  |||*|*|||**
GAATTAATTCCC--CTAAAAATCTTT
[2024-11-07 16:50:39.261] [trace] Checking barcode kit_bc12
[2024-11-07 16:50:39.261] [trace] top window 5
[2024-11-07 16:50:39.261] [trace]
CTTCCGATCTG-TATGACGCAATGA
||*|||||||* ||* |||||||||
CTACCGATCTACTAC-ACGCAATGA
[2024-11-07 16:50:39.261] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCGTCATACAGATCGG-
 |*||*|||*|* | ||*|*|||**
GAATTAATTCCC-C-TAAAAATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc13
[2024-11-07 16:50:39.262] [trace] top window 7
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTG-TGTAGTGCAATGA
||*|||||||* |**|* |||||||
CTACCGATCTACTACAC-GCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 11
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCACTACACAGATCGG-
 |*||*|||*|*|||*| | |||**
GAATTAATTCCCCTAAA-A-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc14
[2024-11-07 16:50:39.262] [trace] top window 5
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTTACTAGT-GCAATGA
||*||||||| ||||** |||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 11
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCACTAGTAAGATCGG-
 |*||*|||*|*|||* || |||**
GAATTAATTCCCCTAA-AA-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc15
[2024-11-07 16:50:39.262] [trace] top window 4
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTG-TACACTGCAATGA
||*|||||||* ||||| |||||||
CTACCGATCTACTACAC-GCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 13
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCAGTGTACAGATCGG-
 |*||*|||*|**|**| | |||**
GAATTAATTCCCCTAAA-A-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc16
[2024-11-07 16:50:39.262] [trace] top window 6
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTTAGTG-ATGCAATGA
||*||||||| |*|* |*|||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCATCACTAAGA-TCGG-
 |*||*|||*|  | ||||*| ||**
GAATTAATTCC--C-CTAAAAATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc17
[2024-11-07 16:50:39.262] [trace] top window 5
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTAC-AGCTGGCAATGA
||*||||||||| | |**|||||||
CTACCGATCTACTA-CACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCCAGCTGTAGATCGG-
 |*||*|||*||  ||**|*|||**
GAATTAATTCCC--CTAAAAATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc18
[2024-11-07 16:50:39.262] [trace] top window 6
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTG-TGACTGGCAATGA
||*|||||||* | ||**|||||||
CTACCGATCTACT-ACACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCCAGTCACAGATCGG-
 |*||*|||*||* |*|*| |||**
GAATTAATTCCCC-TAAAA-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc19
[2024-11-07 16:50:39.262] [trace] top window 4
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTA-TACATGGCAATGA
||*|||||||| ||||*| ||||||
CTACCGATCTACTACACG-CAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCCATGTATAGATCGG-
 |*||*|||*||*|**| | |||**
GAATTAATTCCCCTAAA-A-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc20
[2024-11-07 16:50:39.262] [trace] top window 4
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTAC-ATACGGCAATGA
||*||||||||| |*||| ||||||
CTACCGATCTACTACACG-CAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCCGTATGTAGATCGG-
 |*||*|||*||*||** | |||**
GAATTAATTCCCCTAAA-A-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc21
[2024-11-07 16:50:39.262] [trace] top window 5
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTAC-ATGTCGCAATGA
||*||||||||| |** ||||||||
CTACCGATCTACTACA-CGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 13
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCGACATGTAGATCGG-
 |*||*|||*|* | |**|*|||**
GAATTAATTCCC-C-TAAAAATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc22
[2024-11-07 16:50:39.262] [trace] top window 5
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTTAG-ACTCGCAATGA
||*||||||| |* ||*||||||||
CTACCGATCT-ACTACACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 13
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCGAGTCTAAGA-TCGG-
 |*||*|||*|*   ||||*| ||**
GAATTAATTCCC---CTAAAAATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc23
[2024-11-07 16:50:39.262] [trace] top window 5
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTTATCTAC--GCAATGA
||*||||||| | ||||  |||||||
CTACCGATCT-A-CTACACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 12
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCG-TAGATAAGATCGG-
 |*||*|||*|* || | || |||**
GAATTAATTCCCCTA-A-AA-ATCTTT
[2024-11-07 16:50:39.262] [trace] Checking barcode kit_bc24
[2024-11-07 16:50:39.262] [trace] top window 5
[2024-11-07 16:50:39.262] [trace]
CTTCCGATCTG-TCATACGCAATGA
||*|||||||* | |*|||||||||
CTACCGATCTACT-ACACGCAATGA
[2024-11-07 16:50:39.262] [trace] bottom window 13
[2024-11-07 16:50:39.262] [trace]
-ACTTCATTGCG-TATGACAGATCGG-
 |*||*|||*|* ||* | | |||**
GAATTAATTCCCCTAA-A-A-ATCTTT
[2024-11-07 16:50:39.262] [trace] Scores: kit_bc01 1, kit_bc02 4, kit_bc03 4, kit_bc06 4, kit_bc20 4, kit_bc19 4, kit_bc15 4, kit_bc24 5, kit_bc23 5, kit_bc22 5, kit_bc21 5, kit_bc17 5, kit_bc14 5, kit_bc12 5, kit_bc04 5, kit_bc16 6, kit_bc11 6, kit_bc18 6, kit_bc10 6, kit_bc09 6, kit_bc08 6, kit_bc07 6, kit_bc13 7, kit_bc05 7,
[2024-11-07 16:50:39.262] [trace] BC: kit_barcode01
[2024-11-07 16:50:39.262] [trace] Barcode for 00016867-4590-42ec-b358-e4cf7ee81f13 is kit_barcode01
[2024-11-07 16:50:39.372] [info] > Simplex reads basecalled: 1
[2024-11-07 16:50:39.372] [info] > 1 reads demuxed @ classifications/s: 1.000000e+01
[2024-11-07 16:50:39.372] [debug] Barcode distribution :
[2024-11-07 16:50:39.372] [debug] kit_barcode01 : 1
[2024-11-07 16:50:39.372] [debug] Classified rate 100%
[2024-11-07 16:50:39.372] [info] > finished barcode demuxing
@malton-ont
Copy link
Collaborator

Hi @GeorgescuC,

I can see from the code that we look for the best top and bottom barcode results, which in the first case should be barcodes 1 (top penalty=1) and 3 (bottom penalty = 10). But then we compare the best overall penalties of those two barcodes - this is still the top penalty for barcode 1, but it's also the top penalty (=4) for barcode 3. So our best bottom results is actually pretty confidently predicting barcode3, but for the top rather than the bottom. What I'm not understanding is why this then fails the check, as you say your max_barcode_penalty is 3.

This all works out ok in the passing case as the best bottom penalty is 13 in both barcode 1 and barcode 3, so our std::min_element call finds the first value which is barcode 1 - the same as the best top penalty and everything continues fine.

The general issue is that the bottom barcodes are generating such terrible scores - I think there's a tacit assumption that the best bottom penalty barcode will have a better bottom penalty than it does a top penalty, but that's not the case here. Can I check that you've reverse-complemented the sequences for mask2_front and mask2_rear in the arrangement file? See the docs here.

If that's not the issue, then something looks odd. Are you able to provide us with your barcode arrangement, sequences and the sample read?

@malton-ont malton-ont added the barcode Issues related to barcoding label Nov 11, 2024
@GeorgescuC
Copy link
Author

Hi @malton-ont ,

Thank you for the quick reply. I have attached an archive with the kit description (toml and fasta files) and example read in my original post at the end of the run environment section, but here is a recap of the masks used:

mask1_front = "CTACACGACGCTCTTCCGATCT"
mask1_rear = "GCAATGAAGTCGCAGGGTTGG"
mask2_front = "CTACACGACGCTCTTCCGATCT"
mask2_rear = "GCAATGAAGTCGCAGGGTTGG"

Given that the documentation page you linked to indicates that mask2 sequences are RC when running in double-ended barcode mode and states "For double-ended barcodes which are symmetric, the flank and barcode sequences for front and rear windows are same.", I did not RC them myself. This seems like it is working as I understood it given the trace log that aligns the read to the bottom flank sequence CCAACCCTGCGACTTCATTGCNNNNNNNAGATCGGAAGAGCGTCGTGTAG, which uses RCs of mask2 flanks.

For reference, another of my tests involved running with mask2 sequences being manually RC-ed and again running in double ended barcode mode, but that reduced the classification rate from 65% to 60.5%.
I also tried demuxing using only one sided barcoding on both ends (this time using the RC for the read_only_barcodes=true version), but results also looked off (potentially due to midstrand flanks detection), so I wanted to dig into that once this current issue is resolved as there may be overlap.

For the classification scoring itself, given my understanding of how the scoring was going to work, there are multiple steps where the bottom strand barcode should not have been called as confident. Here is how I expected the algorithm to operate once the flank alignments have been set, and where an issue could be:
look at both strands:

  • top strand:
    • find barcode 01 with 1 edit distance as the best hit
    • keep barcode 01 as a good match as it is within max_barcode_penalty = 6
    • next best barcodes are at 4 edit distance, so they get discarded as they are 3 or more distance further away (min_barcode_penalty_dist)
    • return barcode 01 with 1 penalty score as the proper match
  • bottom strand:
    • find barcodes 03, 11 and 13 with 10 edit distance as the best hits
    • since there is a tie for the best hit, no confident hit can be assigned <---- potential issue here
    • return no barcode as the proper match
  • compare both returned barcodes, since only one side returned a hit, accept that one
  • final classification is barcode 01

Assuming a case where barcodes 11 and 13 were not a tie with barcode 03 on the bottom strand, I would then have assumed this logic (keeping other scores the same):
look at both strand:

  • top strand:
    • find barcode 01 with 1 edit distance as the best hit
    • keep barcode 01 as a good match as it is within max_barcode_penalty = 6
    • next best barcodes are at 4 edit distance, so they get discarded as they are 3 or more distance further away (min_barcode_penalty_dist)
    • return barcode 01 with 1 penalty score as the proper match
  • bottom strand:
    • find barcodes 03 with 10 edit distance as the best hit
    • keep barcode 03 as a low confidence match since its penalty is more than max_barcode_penalty
    • next best barcodes are at 1 edit distance, which is within min_separation_only_dist, so the best match can not be confidently called <---- potential issue here
    • return no barcode as the proper match
  • compare both returned barcodes, since only one side returned a hit, accept that one
  • final classification is barcode 01

I would also further have assumed that since the documentation states For double ended barcodes, the best window (either from the front or rear of the read) is chosen based on the alignment of the flanking mask sequences., if the front barcode is within max_barcode_penalty but the bottom barcode is not, any bottom barcode result would be ignored. This however may be more debatable because it would be hard to find a case where a barcode with more than max_barcode_penalty can still be further away from the next best hit by more than min_separation_only_dist.

Best,
Christophe.

@malton-ont
Copy link
Collaborator

Hi @GeorgescuC,

The read you have provided classifies correctly as barcode01 with that arrangement - I'd be interested in the one you showed above that rejected based on barcode03 having a top edit distance of 4...?

That read also doesn't appear to have a rear barcode at all - are you certain the barcodes are on both ends? Dorado is expecting to look for a sequence like:

5' --- ADAPTER/PRIMER --- MASK1_FRONT --- BARCODE_1 --- MASK1_REAR --- READ --- RC(MASK2_REAR) --- RC(BARCODE_2) --- RC(MASK2_FRONT ) --- 3'

(and the reverse strand)

If this is correct, I suspect that the adapter trimming during basecalling has been over-zealous - you should try rebasecalling with the --no-trim option before demuxing. Another option would be to perform classification during basecalling (by providing the custom kit info to the basecaller command) and then demuxing with --no-classify instead of providing the kit info to the demux command - this should ensure that adapter trimming is performing after the barcodes have been identified. See the barcoding and read trimming documentation.

Your description of the algorithm is not quite correct. For double ended barcodes, dorado:

  • assigns a top and bottom penalty (edit distance) to a result for each barcode
  • finds the result with the best (lowest) top penalty and the result with the best bottom penalty
  • checks if these results have:
    • both confidently predicted a barcode and **
    • have a similar penalty and, if so
    • checks they have predicted the same barcode
  • finds the best two barcodes based on the best penalty of each
  • checks that the best result is close to the end of the read and
    • is good enough and sufficiently better than the second best, or
    • is much better than the second best result
  • finally, if --barcode-both-ends has been set, checks that both the top and bottom penalty of the best result are both good enough

** this is the suspicious part - it doesn't require that the confident prediction is the top or bottom of the respective result so if the best bottom result is bad but has a highly confident (but not the best) top penalty we may still declassify it.

@GeorgescuC
Copy link
Author

Hi @malton-ont ,

The 2 configurations that I provided the run logs for use the same masks and so both cases have barcode03 with a top edit distance of 4, the only difference is the rear_barcode_window size which is 175bp in the case that classifies it properly, and 250bp in the case that does not classify it. In the later case, a very slightly better region is found to align the bottom barcode, but it is still low quality.

Regarding the read structure, the barcode can be present on either end of the read, but will only be present on one end. The molecules here are double stranded cDNA and the barcode being demuxed comes from the original RNA capture (comparable to an UMI in its location, although it is not one). This is a sub-barcode, so the ONT adapter/barcode were added to the cDNA for sequencing and then used for basecalling + demuxing. I then took the uBAM of the ONT barcode for this library and ran demuxing with the provided kit configs.

Basically, these reads can have either of these 2 structures (ignoring the ONT adapters and barcodes trimmed in the first demux round):
5' --- MASK1_FRONT --- BARCODE_1 --- MASK1_REAR --- READ --- 3'
or
5' --- READ --- RC(MASK2_REAR) --- RC(BARCODE_2) --- RC(MASK2_FRONT) --- 3'

I understand that using this config ideally works best when the barcode is on both sides, but given the size of the masks and flanks used, I thought the penalty scores of the side that does not have the barcode should always be so bad that it is never considered as a good enough match, given min_separation_only_dist = 5 while the barcodes are only 3 edit distance away.

I also thought a better approach might be to first look for front barcodes only, then for rear barcodes only among the unclassified reads, as this would also help with reorienting the reads where the barcode was found at the rear. However the tests I ran did not give me confidence in the results.
I tried a few things to check for high-level consistency and false positive calls:

  • front only barcode search with forward masks:
    • 35.478127% classified rate
  • rear only barcode search with forward masks (which should not find anything since for the rear the masks should be RCed):
    • if I leave front_barcode_window = 175, 35.478127% classified rate, identical to above because it still aligns at the front, so it seems like it is completely ignoring rear_only_barcodes = true
    • if I set front_barcode_window = 0, I get a segfault after a couple reads, and those few reads are unclassified because midstrand barcode flanks with score 1
  • rear only barcode search with RC masks:
    • 49.103554% classified rate, which seems too high in comparison with the results looking for front barcodes given that they are likely present in a similar ratio
  • barcode on both sides with both forward masks (case discussed in this issue):
    • 65.043594% classified rate, which is not equal to the sum of classifying either side independently
  • barcode on both sides with front forward masks and rear RCed masks:
    • 60.5823% classified rate, here again the rear masks are incorrect so it should not classify so many more read than just looking for a front barcode alone.

If you think it would be easier or more helpful, I can privately share the whole uBAM used in these tests (1,224,219 reads for slightly below 1GB file size)

Small note on the verbose debugging, it would be nice if the log indicated what read is currently being aligned/scored for all reads and at the beginning of that read's log.

Best,
Christophe

@malton-ont
Copy link
Collaborator

Hi @GeorgescuC,

Basically, these reads can have either of these 2 structures (ignoring the ONT adapters and barcodes trimmed in the first demux round):
5' --- MASK1_FRONT --- BARCODE_1 --- MASK1_REAR --- READ --- 3'
or
5' --- READ --- RC(MASK2_REAR) --- RC(BARCODE_2) --- RC(MASK2_FRONT) --- 3'

Are these both possible template strands? Given your mask1 and mask2 are identical, if the second strand here is meant to represent a complement strand then this should be treated as a single ended barcode - i.e. you should remove mask2 from your config. By providing both sets of masks you are telling dorado that it expects a barcode at both ends, not at either end. If this is genuinely a prep where the barcode can ligate to either end of the template strand, dorado does not support this as a standard configuration.

I'm surprised that running single-ended on the front mask and then single ended with rear_only_barcodes doesn't provide an improvement (that was going to be my suggestion) - this should only search rear_barcode_window bases at the end of the read so if your sequence is longer than that it should never see any front barcodes.

So I would expect these configs to work:

fwd_config
mask1_front = MASK1_FRONT
mask1_rear = MASK2_REAR

Gets you your initial 35%. Then, on the remaining unclassified reads:

rear_config
mask1_front = RC(MASK1_FRONT)
mask2_front = RC(MASK2_REAR)
rear_only_barcodes = true

It sounds like your main issue with this method is that you are getting a higher proportion of reads with the barcode at the rear? I'm not sure I can comment on how likely that is based on your sample prep.
If you raise an issue with our support team (referencing this issue and my name), they can provide you with a link to upload your data and I can take a closer look.

@GeorgescuC
Copy link
Author

Hi @malton-ont ,

Yes I removed mask2 in the instances where searching for only on one side. I also extended the barcodes to have both their normal version and RC version in some of the tests, and that works as expected. The configs I used are:

5'-search forward-masks:

mask1_front = MASK1_FRONT
mask1_rear = MASK1_REAR

barcode1_pattern = BARCODES and RC(BARCODES)

3'-search forward-masks (expecting to not find anything):

mask1_front = MASK1_FRONT
mask1_rear = MASK1_REAR
rear_only_barcodes = true

barcode1_pattern = BARCODES and RC(BARCODES)

3'-search rc-masks:

mask1_front = RC(MASK1_REAR)
mask1_rear = RC(MASK1_FRONT)
rear_only_barcodes = true

barcode1_pattern = RC(BARCODES)

For all these tests, I ran using the full uBAM of the ONT barcode so classification rates are comparable.

It sounds like your main issue with this method is that you are getting a higher proportion of reads with the barcode at the rear?

There seem to be a few issues that may be compounding so I originally made this issue about a single one of them to try to figure out things step by step, but then mentioned the others as well:

  • using a double sided barcode search, in at least one case where a good barcode match is found (1 penalty) at the 5' front, using different window sizes for the 3' rear search where both "best matches" are bad (10 and 11 penalty at best) by more than max_barcode_penalty and with ties between barcodes yields different results: one case properly classifies based on the front barcode found, one case leaves the read unclassified because "ambiguous". Given in both cases the best 3' rear match is just as bad, I expect the read to be classified in both cases to the 5' front match, or to at least be consistent and have the same final result.
  • using rear_only_barcodes = true with a non-zero front_barcode_window and the same forward-masks as the front search gives the same result as doing the default front barcode search, so it seems like rear_only_barcode is ignored.
  • using read_only_barcodes = true with a zero front_barcode_window and the same forward-masks segfaults after a few reads
  • using read_only_barcodes = true with a zero front_barcode_window and the rc-masks runs fine, but the classification rate seems unexpectedly high given the experimental design. I cannot rule out yet that the results are true, but given the 3 issues above, I have not dedicated time to digging into it.

I will contact your support team then. I think it will much easier to figure out what is going wrong (either an error on my side or a bug) with all files in hand. I will include all the configs I used and a short list of commands I ran to quickly reproduce.

Best,
Christophe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
barcode Issues related to barcoding
Projects
None yet
Development

No branches or pull requests

2 participants