Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dorado duplex reads' mean quality are lower than simplex reads #1067

Open
ErminZ opened this issue Oct 4, 2024 · 3 comments
Open

Dorado duplex reads' mean quality are lower than simplex reads #1067

ErminZ opened this issue Oct 4, 2024 · 3 comments

Comments

@ErminZ
Copy link

ErminZ commented Oct 4, 2024

Possible reasons for duplex reads' quality are low

Hello, thank you for increasing the read quality! We always observe higher read mean quality on duplex than simplex reads, especially simplex paired reads. However, recently one sample has the opposite shown in the picture. Could you explain why duplex reads' quality is lower than simplex reads?

The mean quality of duplex reads is lower than simplex shown below:

image

Here are two duplex reads and their simplex paired reads examples:

image

Would you explain possible reasons why the duplex reads' quality is low? Please let me know if you have any questions. Thank you!

Sample information:

  1. PCR-targeted sequencing: two amplicons (500bp wild type and 2.2kb insertion into wild type);
  2. Hybridization to enrich the reads only contains the 1.7kb insertion;
  3. PCR again amplifies the enriched 2.2kb reads with the targeted insertion.
  4. The flow cell is overloaded due to the inaccurate read length evaluation before sequencing.

Run environment:

  • Dorado version: V0.7.1 docker genomicpariscentre/dorado:0.7.1

  • Dorado command:
    dorado duplex sup ${pod5_directory} > ${sample_id}.bam --min-qscore 10

    dorado summary ${sample_id}.bam > ${sample_id}_dorado_summary.tsv

  • Operating system: Linux

  • Hardware (CPUs, Memory, GPUs): AWS EC2 p3 8 GPUs instance

  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5

  • Source data location (on device or networked drive - NFS, etc.):

  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
    flow cell:
    image
    kit: SQK-LKS114
    Read length:
    image
    total dataset:

Simplex reads basecalled: 1,396,6178
[info] > Simplex reads filtered: 690,388

[info] > Duplex reads basecalled: 4,882,572
[info] > Duplex reads filtered: 510,211
 Duplex rate: 49.04696%
Basecalled @ Bases/s: 3.975039e+05
  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

  • Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)
@vellamike
Copy link
Collaborator

Hello @ErminZ

I’m quite puzzled by the very low accuracy you’re seeing—it’s unexpected and strange, especially given that duplex reads typically outperform simplex. It’s unusual to see the opposite in this case. Are these Q scores predicted or aligned?

That being said, with your approach of sequencing duplex with amplicons, there’s no guarantee that the following strand will be the complement of the first one. Given that duplex relies on correctly pairing template and complement strands, this could lead to the issues you’re observing.

It probably makes more sense to use simplex for your application.

@ErminZ
Copy link
Author

ErminZ commented Oct 10, 2024

Thank you for the reply! It is very helpful.

Maybe only simplex reads are the best choice for PCR-targeted samples with >80% of reads being duplicates? Would you tell more about whether there are factors other than sequencing location, start/end time, and read sequences and length that influence the duplex paring? Thank you!

The Q scores are from the dorado summary function. I guess it is predicted?

@malton-ont
Copy link
Collaborator

@ErminZ,

Descriptions of the pairing criteria are provided here (although there appears to be an error in part 1 - the code actually uses 10000ms, and has an additional constraint that both reads must have a qscore >= 8):

// Determine whether 2 proposed reads form a duplex pair or not.
// The algorithm utilizes the following heuristics to make a decision -
// 1. Reads must be within 1000ms of each other, and the ratio of their
//    lengths must be at least 20%.
// 2. If the lengths are >98% similar, reads are at least 5KB, and time
//    delta is <100ms, consider them to be a pair.
// 3. If the early acceptance fails, then run minimap2 to generate overlap
//    coordinates. If there is only 1 hit from minimap2 mapping,
//    the mapping quality is high (>50), the overlap covers
//    most of the shorter read (80%), the overlap is at least 50 bp long,
//    one read maps to the reverse strand of the other, and the end
//    of the template is mapped to the beginning
//    of the complement read, then consider them a pair.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants