You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been playing with omniscope's COVID dataset that provides 8M TCR receptors. By doing so, I identified several bottlenecks that make working with >1M cells in scirpy painful or impossible.
This meta issue is to give an overview of the progress improving scirpy's scalability.
graph TB
subgraph legend
legend1(could be faster -- minutes)
OK(OK -- seconds)
legend2(prohibitively slow -- hours)
legend3(not profiled yet)
style legend1 stroke:#ff7f00
style OK stroke:#4daf4a
style legend2 stroke:#e41a1c
end
ir_dist (Speed up ir_dist #304). Needs more scalable methods for computing sequence distances.
define_clonotypes (speed up define_clonotypes #368). At the very least needs a better parallelization. Maybe there's room for some jax/numba.
autoencoder-based embedding (Autoencoder-based sequence embedding #369). Possible alternative to ir_dist. Maybe it even makes sense to combine ir_dist and define_clonotypes into a single step.
The text was updated successfully, but these errors were encountered:
The most pressing points here were addressed by @felixpetschko.
Closing the issue. Speeding up read_airr and integration with sequence embeddings #369 are tracked in separate issues.
Description of feature
I have been playing with omniscope's COVID dataset that provides 8M TCR receptors. By doing so, I identified several bottlenecks that make working with >1M cells in scirpy painful or impossible.
This meta issue is to give an overview of the progress improving scirpy's scalability.
Action items
ir_dist
. Maybe it even makes sense to combineir_dist
anddefine_clonotypes
into a single step.The text was updated successfully, but these errors were encountered: