Releases · mobiusml/gemlite · GitHub

10 Dec 14:11

mobicham

v0.4.1 Latest

Latest

Fix bugs related to config caching.

Assets 2

05 Dec 09:51

mobicham

v0.4.0

Improved performance on the A100 and H100.
Flexible bitpacking support (32-bit / 8-bit, over cols or rows).
Best config caching over all kernels.
Helper functions for easier usage.
GEMV_SPLITK kernel for better performance at batch-size=1 with non-packed data.
Improved accuracy via dumping for 8-bit weights with GEMV kernels.
Max-autotuning.
Avoid out-of-shared-memory by limiting num_stages based on the GPU device.
Various bug fixes.

Assets 2

28 Oct 16:05

mobicham

v0.3.0

New GEMV RevSplitK algorithm outperforms GEMM Split-K and GEMV for batch-size=1
Add support for channel-wise scaling (weights, activations, weights + activations)
Add support for FP8 x FP8 / FP8 x Wn
Add support for INT8 x Wn
Improved autotune speed
Improved base configs for 4090 RTX, A100 and H100
Better control for autotune via set_autotune

Assets 2

18 Oct 20:14

mobicham

v.0.2.1

Adds GEMM Split-K Support
torch.compile() support
Tunable A loading order, eviction policies and atomic add mode
Overall performance improvement

Assets 2

19 Sep 11:24

mobicham

v0.1.0

Triton Kernels

A16W8 (GEMV + GEMM) - with grouping
A16W4 (GEMV + GEMM) - with grouping
A16W2 (GEMV + GEMM) - with grouping
A16W1 (GEMV + GEMM) - with grouping

CUDA Kernels

A16W8 (GEMV - batch-size=1) - no grouping
A16W4 (GEMV - batch-size=1) - no grouping
A16W2 (GEMV - batch-size=1) - no grouping
A8W8 (GEMV - batch-size=1) - no grouping
A8W4 (GEMV - batch-size=1) - no grouping
A8W2 (GEMV - batch-size=1) - no grouping

Assets 2