Skip to content

Releases: mobiusml/gemlite

v0.4.1

10 Dec 14:11
Compare
Choose a tag to compare

Fix bugs related to config caching.

v0.4.0

05 Dec 09:51
8908e50
Compare
Choose a tag to compare
  • Improved performance on the A100 and H100.
  • Flexible bitpacking support (32-bit / 8-bit, over cols or rows).
  • Best config caching over all kernels.
  • Helper functions for easier usage.
  • GEMV_SPLITK kernel for better performance at batch-size=1 with non-packed data.
  • Improved accuracy via dumping for 8-bit weights with GEMV kernels.
  • Max-autotuning.
  • Avoid out-of-shared-memory by limiting num_stages based on the GPU device.
  • Various bug fixes.

v0.3.0

28 Oct 16:05
Compare
Choose a tag to compare
  • New GEMV RevSplitK algorithm outperforms GEMM Split-K and GEMV for batch-size=1
  • Add support for channel-wise scaling (weights, activations, weights + activations)
  • Add support for FP8 x FP8 / FP8 x Wn
  • Add support for INT8 x Wn
  • Improved autotune speed
  • Improved base configs for 4090 RTX, A100 and H100
  • Better control for autotune via set_autotune

v.0.2.1

18 Oct 20:14
29980e2
Compare
Choose a tag to compare
  • Adds GEMM Split-K Support
  • torch.compile() support
  • Tunable A loading order, eviction policies and atomic add mode
  • Overall performance improvement

v0.1.0

19 Sep 11:24
4690460
Compare
Choose a tag to compare

Triton Kernels

  • A16W8 (GEMV + GEMM) - with grouping
  • A16W4 (GEMV + GEMM) - with grouping
  • A16W2 (GEMV + GEMM) - with grouping
  • A16W1 (GEMV + GEMM) - with grouping

CUDA Kernels

  • A16W8 (GEMV - batch-size=1) - no grouping
  • A16W4 (GEMV - batch-size=1) - no grouping
  • A16W2 (GEMV - batch-size=1) - no grouping
  • A8W8 (GEMV - batch-size=1) - no grouping
  • A8W4 (GEMV - batch-size=1) - no grouping
  • A8W2 (GEMV - batch-size=1) - no grouping