Use exllamav2's smart 4-bit KV cache for memory benchmark #185

Interpause · 2024-05-14T07:39:05Z

See: https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md

exllamav2 has a 4-bit KV cache that has similar ppl to unquantized cache from turboderp's testing. In practice, I find that exllamav2 uses less VRAM than llama.cpp for a given context size as a result. I noticed the exllamav2 benchmark code uses the unquantized cache. Could it be possible to use the 4-bit KV cache again for the memory usage benchmark? Thanks.

For reference, here's the class to use instead: https://github.com/turboderp/exllamav2/blob/009424a6d42d39efceeecd5562450180bd34a7fb/exllamav2/cache.py#L309

Anindyadeep · 2024-05-15T06:22:00Z

We can even add ExLlamav2 for float16 based on this comment. This also needs to be checked out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use exllamav2's smart 4-bit KV cache for memory benchmark #185

Use exllamav2's smart 4-bit KV cache for memory benchmark #185

Interpause commented May 14, 2024 •

edited

Loading

Anindyadeep commented May 15, 2024

Use exllamav2's smart 4-bit KV cache for memory benchmark #185

Use exllamav2's smart 4-bit KV cache for memory benchmark #185

Comments

Interpause commented May 14, 2024 • edited Loading

Anindyadeep commented May 15, 2024

Interpause commented May 14, 2024 •

edited

Loading