HELP WANTED: design advice on sequence content index #18

mlin · 2021-02-11T08:29:38Z

mlin
Feb 11, 2021
Maintainer

We'd like to add sequence content indexing, to answer queries for stored DNA/RNA [sub]sequences with similarity to a given one. I'd like advice from the community on what data structures / algos should implement this.

For a table target_table where each row has a column storing DNA/RNA text, after some sort of indexing we can query it like

SELECT target_rowid, target_begin, target_end, query_begin, query_end, hit_score
  FROM genomic_sequence_search('target_table', query_sequence[, tuning_parameters])
  WHERE 100*(query_end-query_begin)/length(query_sequence) >= 50
  ORDER BY hit_score DESC

where query_sequence is a literal DNA/RNA text.

Wish list:

Scalable to corpi of 10¹² nucleotides, with query runtime sublinear in database size
Embeds in SQLite's B-tree index and/or "shadow tables"
- In order to inherit disk/RAM paging + basic compression + transactional updates
- Means queries can be executed by some series of point lookups and/or range scans on compound, tuple-ordered keys
- See the docs for SQLite's full-text search module for an example of how more-complex data structures can be spun out of these primitives
- An optional upfront/amortized loading step (such as a Bloom filter) to accelerate a series of many queries, would be OK
Add new sequences w/o full rebuild
Should not use more storage than compressed sequences (i.e. index shouldn't double database file size)
Reasonably understood sensitivity & specificity properties

Non-goals:

Needn't compete on speed with dedicated mappers/overlappers. 10x slower would be fantastic, 100x probably acceptable.
Not research, but engineering: we don't want the bleeding edge that'll surely be superseded in a few months time, but rather something trustworthy we'll be able to support for 10+ years

JZL · 2023-03-21T17:32:19Z

JZL
Mar 21, 2023

I haven't tested anything but I'm getting more and more interested in using sqlite for quick bam-queries because it could also scale to larger queries using bigquery or parallelized per library.

I'm curious your thoughts on using sqlite's inbuilt Levenshtein distance, I guess it depends on how long the sequence is, and if you're matching very small subsequences or the majority of the reads. For single-cell/adjacent, it could be nice to match cell barcodes more accurately, when some programs just using simple hamming distance without even proper indels. Or maybe a variant of this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HELP WANTED: design advice on sequence content index #18

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

HELP WANTED: design advice on sequence content index #18

mlin Feb 11, 2021 Maintainer

Replies: 1 comment

JZL Mar 21, 2023

mlin
Feb 11, 2021
Maintainer

JZL
Mar 21, 2023