diff --git a/README.md b/README.md index 0b07eaf..16f65cf 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,14 @@ -GemLite is a collection of straightforward CUDA and Triton kernels for efficient, fused low-bit matrix multiplication. It is specifically designed for simplicity and reasubility. +# GemLite +GemLite is a collection of straightforward CUDA and Triton kernels for efficient, fused low-bit matrix multiplication. It is specifically designed for simplicity and reasubility. This project was initiated because we found it challenging to customize the low-bit kernels that are currently available. GemLite provides both flexibility and performance, enabling users to easily modify the codebase to develop high-performance kernels tailored to their specific needs. -While GemLite can outperform the best existing implementations on large matrices, there's still potential for further optimization! +While GemLite can outperform the best existing implementations on large matrices, there's still potential for further optimization!
- 8bit_gs=infeatures_32768x32768_4090RTX + 8bit_gs=infeatures_32768x32768_4090RTX
@@ -15,14 +16,14 @@ While GemLite can outperform the best existing implementations on large matrices
- 4bit_gs=128_32768x32768_4090RTX + 4bit_gs=128_32768x32768_4090RTX
- 2bit_gs=128_32768x32768_4090RTX + 2bit_gs=128_32768x32768_4090RTX
@@ -83,28 +84,28 @@ We present performance results across various batch sizes on the RTX 4096. Perfo 8-bit Weights
- 8bit_gs=infeatures_4096x4096_4090RTX + 8bit_gs=infeatures_4096x4096_4090RTX
- 8bit_gs=infeatures_8192x8192_4090RTX + 8bit_gs=infeatures_8192x8192_4090RTX
- 8bit_gs=infeatures_16384x16384_4090RTX + 8bit_gs=infeatures_16384x16384_4090RTX
- 8bit_gs=infeatures_32768x32768_4090RTX + 8bit_gs=infeatures_32768x32768_4090RTX
@@ -115,28 +116,28 @@ We present performance results across various batch sizes on the RTX 4096. Perfo 4-bit Weights
- 4bit_gs=128_4096x4096_4090RTX + 4bit_gs=128_4096x4096_4090RTX
- 4bit_gs=128_8192x8192_4090RTX + 4bit_gs=128_8192x8192_4090RTX
- 4bit_gs=128_16384x16384_4090RTX + 4bit_gs=128_16384x16384_4090RTX
- 4bit_gs=128_32768x32768_4090RTX + 4bit_gs=128_32768x32768_4090RTX
@@ -146,28 +147,28 @@ We present performance results across various batch sizes on the RTX 4096. Perfo 2-bit Weights
- 2bit_gs=128_4096x4096_4090RTX + 2bit_gs=128_4096x4096_4090RTX
- 2bit_gs=128_8192x8192_4090RTX + 2bit_gs=128_8192x8192_4090RTX
- 2bit_gs=128_16384x16384_4090RTX + 2bit_gs=128_16384x16384_4090RTX
- 2bit_gs=128_32768x32768_4090RTX + 2bit_gs=128_32768x32768_4090RTX
@@ -244,7 +245,7 @@ Although the kernels are designed for general purposes, they perform well in pra
3090 - 4090 + 4090