Skip to content

Commit

Permalink
doc fix
Browse files Browse the repository at this point in the history
  • Loading branch information
charlifu committed Jun 1, 2024
1 parent 2b5c5d1 commit 977361b
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 4 deletions.
6 changes: 3 additions & 3 deletions ROCm_performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ The custom PagedAttention kernel is enabled for dtype: fp16, block-size=16, head

## Fp8 Quantization

To use fp8 quantization, first step is to use Nvidia ammo to quantize your model to fp8 format, following this [instruction](https://github.com/vllm-project/vllm/blob/main/examples/fp8/quantizer/README.md). This will give a safetensor file that contains the quantized weights and the corresponding scaling factors of your model. We will need to put the safetensor file under your model folder, and add file called `serenity_config.json`, which contains a json object with a key: `"quantized_weights": "quantized/osf/rank0.safetensors"`, the value should be the releative path of your safetensor file containing the quantized weights.
To use fp8 quantization, first step is to quantize your model to fp8 format. Generating a safetensor file that contains the quantized weights and the corresponding scaling factors of your model. The safetensor file should be added under your model folder along with a file called `serenity_config.json`, which contains a json object with a key: `"quantized_weights": "quantized/osf/rank0.safetensors"`, the value should be the relative path of your safetensor file containing the quantized weights.

Then we can run a model with fp8 quantization using vllm, just add a parameter `quantization="fp8"` when creating the vllm.LLM object.
Then we can run a model with fp8 quantization using vllm, just add a parameter `quantization="fp8"` when creating the `vllm.LLM` object.

## Gemm Tunning for Fp8

Expand All @@ -37,7 +37,7 @@ Next, run gradlib to obtain the best solutions of these shapes:
cd gradlib_fp8
python3 -m pip uninstall gradlib
python3 setup.py install
python3 gemm_tunner.py --input_file /fp8_shapes.csv --tuned_file /tuned_fp8_16.csv
python3 gemm_tuner.py --input_file /fp8_shapes.csv --tuned_file /tuned_fp8_16.csv
cd ../gradlib
python3 -m pip uninstall gradlib
python3 setup.py install
Expand Down
2 changes: 1 addition & 1 deletion vllm/model_executor/layers/quantization/fp8_rocm.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ def get_min_capability(cls) -> int:

@classmethod
def get_name(cls) -> str:
return "serenity"
return "Fp8Rocm"

def get_linear_method(self) -> "Fp8RocmLinearLayer":
return Fp8RocmLinearLayer(self)
Expand Down

0 comments on commit 977361b

Please sign in to comment.