The provided bash scripts show an example to deploy and run the quantized LLama 3.1 8B Instruct FP8 model from Nvidia's Hugging Face model hub on TensorRT-LLM and vLLM respectively.
Before running the bash scripts, please make sure you have setup the environment properly:
- Download the modelopt quantized checkpoints. You can either download all checkpoints with this script, or use
huggingface-cli download <HF repo> --local-dir <local_dir>
to download a specific one. - Git clone the TensorRT-LLM repo and install it properly by following instructions here.
- Install vLLM properly by following instructions here.
Then, to deploy and run on TensorRT-LLM:
bash llama_fp8_deploy_trtllm.sh <YOUR_HF_CKPT_DIR> <YOUR_TensorRT_LLM_DIR>
To deploy and run on vLLM:
bash llama_fp8_deploy_vllm.sh <YOUR_HF_CKPT_DIR>
If you want to run post-training quantization with TensorRT Model Optimizer for your selected models, check here.