diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md index 8aaccfd124..1cb1d32c0f 100755 --- a/examples/text-generation/README.md +++ b/examples/text-generation/README.md @@ -451,6 +451,31 @@ QUANT_CONFIG=./quantization_config/maxabs_quant_phi.json python run_generation.p --reuse_cache ``` +Here is an example to measure the tensor quantization statistics on gemma with 1 card: + +```bash +QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py \ +--model_name_or_path google/gemma-7b \ +--use_hpu_graphs \ +--use_kv_cache \ +--max_new_tokens 100 \ +--batch_size 1 \ +--reuse_cache \ +--bf16 +``` + +Here is an example to quantize the model based on previous measurements for gemma with 1 card: +```bash +QUANT_CONFIG=./quantization_config/maxabs_quant_gemma.json python run_generation.py \ +--model_name_or_path google/gemma-7b \ +--use_hpu_graphs \ +--use_kv_cache \ +--max_new_tokens 100 \ +--batch_size 1 \ +--reuse_cache \ +--bf16 +``` + ### Running FP8 models on single device diff --git a/examples/text-generation/quantization_config/maxabs_quant_gemma.json b/examples/text-generation/quantization_config/maxabs_quant_gemma.json new file mode 100644 index 0000000000..e7c6b6ddd2 --- /dev/null +++ b/examples/text-generation/quantization_config/maxabs_quant_gemma.json @@ -0,0 +1,12 @@ +{ + "method": "HOOKS", + "mode": "QUANTIZE", + "observer": "maxabs", + "scale_method": "maxabs_hw", + "blocklist": {"types": [], "names": [ + "matmul_qk", + "matmul_av", + "lm_head" + ]}, + "dump_stats_path": "./hqt_output/measure" +}