How sparsity helps on GPUs? #1449
Unanswered
Nick-infinity
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am assuming CSR format is used to store sparse tensor.
For a 7B model that is unstructurally pruned by 70%, the model will have 2.1 B non zero parameters.
The CSR format will increase the size of a non zero values sparse tensor by atleast 2.5 times ( non zero vals, column index and row pointer).
I.e 7B model will become 5.25 B (2.1 x 2.5) model . The speedups will be small as the GPU llm inference in memory bound as all the 5.25B weights will need to be touched by GPU for a single token generation.
It would be really helpful if someone can help me to understand the deepsparse inference gains better
Beta Was this translation helpful? Give feedback.
All reactions