Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability to >1M cells #370

Closed
grst opened this issue Oct 9, 2022 · 1 comment
Closed

Scalability to >1M cells #370

grst opened this issue Oct 9, 2022 · 1 comment

Comments

@grst
Copy link
Collaborator

grst commented Oct 9, 2022

Description of feature

I have been playing with omniscope's COVID dataset that provides 8M TCR receptors. By doing so, I identified several bottlenecks that make working with >1M cells in scirpy painful or impossible.

This meta issue is to give an overview of the progress improving scirpy's scalability.

graph TB
    subgraph legend
         legend1(could be faster -- minutes)
         OK(OK -- seconds)
         legend2(prohibitively slow -- hours)
         legend3(not profiled yet)
         style legend1 stroke:#ff7f00
         style OK stroke:#4daf4a
         style legend2 stroke:#e41a1c
    end
Loading
graph TB
    subgraph preprocessing
      IO --> index_chains
      index_chains --> QC
      QC --> dist_id[ir_dist identity]
      QC --> dist_levenshtein[ir_dist levenshtein]
      QC --> dist_alignment[ir_dist alignment]
      dist_id --> define_clonotypes
      dist_levenshtein --> define_clonotypes
      dist_alignment --> define_clonotypes
      define_clonotypes --> clonotypes
      QC -.-> autoencoder
      autoencoder -.-> clonotypes
      autoencoder -.-> define_clonotypes

      clonotypes[(CLONOTYPES)]
      
      style IO stroke:#ff7f00
      style index_chains stroke:#ff7f00
      style QC stroke:#4daf4a
      style dist_id stroke:#4daf4a
      style define_clonotypes stroke:#e41a1c
      style dist_levenshtein stroke:#e41a1c
      style dist_alignment stroke:#e41a1c
      style clonotypes stroke:white
   end
   
   subgraph downstream
      clonotypes --> clonotype_network
      clonotypes --> other[other tools]
   end
Loading

Action items

  1. data structure (Implement scverse datastucture #356). The foundation for other changes. Might also speed up saving the anndata object.
  2. reading data (Speed up read_airr #367). User experience can be improved, but not a top priority atm.
  3. index_chains (Speed up index_chains #386). Could be faster
  4. ir_dist (Speed up ir_dist #304). Needs more scalable methods for computing sequence distances.
  5. define_clonotypes (speed up define_clonotypes #368). At the very least needs a better parallelization. Maybe there's room for some jax/numba.
  6. autoencoder-based embedding (Autoencoder-based sequence embedding #369). Possible alternative to ir_dist. Maybe it even makes sense to combine ir_dist and define_clonotypes into a single step.
@grst
Copy link
Collaborator Author

grst commented Nov 24, 2024

The most pressing points here were addressed by @felixpetschko.
Closing the issue. Speeding up read_airr and integration with sequence embeddings #369 are tracked in separate issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

1 participant