Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
download_hf_ckpt.py		download_hf_ckpt.py
llama_fp8_deploy_trtllm.sh		llama_fp8_deploy_trtllm.sh
llama_fp8_deploy_vllm.sh		llama_fp8_deploy_vllm.sh

README.md

Deploy quantized models from Nvidia's Hugging Face model hub with TensorRT-LLM and vLLM

The provided bash scripts show an example to deploy and run the quantized LLama 3.1 8B Instruct FP8 model from Nvidia's Hugging Face model hub on TensorRT-LLM and vLLM respectively.

Before running the bash scripts, please make sure you have setup the environment properly:

Download the modelopt quantized checkpoints. You can either download all checkpoints with this script, or use huggingface-cli download <HF repo> --local-dir <local_dir> to download a specific one.
Git clone the TensorRT-LLM repo and install it properly by following instructions here.
Install vLLM properly by following instructions here.

Then, to deploy and run on TensorRT-LLM:

bash llama_fp8_deploy_trtllm.sh <YOUR_HF_CKPT_DIR> <YOUR_TensorRT_LLM_DIR>

To deploy and run on vLLM:

bash llama_fp8_deploy_vllm.sh <YOUR_HF_CKPT_DIR>

If you want to run post-training quantization with TensorRT Model Optimizer for your selected models, check here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model_hub

model_hub

README.md

Deploy quantized models from Nvidia's Hugging Face model hub with TensorRT-LLM and vLLM

Files

model_hub

Directory actions

More options

Directory actions

More options

Latest commit

History

model_hub

Folders and files

parent directory

README.md

Deploy quantized models from Nvidia's Hugging Face model hub with TensorRT-LLM and vLLM