Skip to content

Latest commit

 

History

History

model_hub

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Deploy quantized models from Nvidia's Hugging Face model hub with TensorRT-LLM and vLLM

The provided bash scripts show an example to deploy and run the quantized LLama 3.1 8B Instruct FP8 model from Nvidia's Hugging Face model hub on TensorRT-LLM and vLLM respectively.

Before running the bash scripts, please make sure you have setup the environment properly:

  • Download the modelopt quantized checkpoints. You can either download all checkpoints with this script, or use huggingface-cli download <HF repo> --local-dir <local_dir> to download a specific one.
  • Git clone the TensorRT-LLM repo and install it properly by following instructions here.
  • Install vLLM properly by following instructions here.

Then, to deploy and run on TensorRT-LLM:

bash llama_fp8_deploy_trtllm.sh <YOUR_HF_CKPT_DIR> <YOUR_TensorRT_LLM_DIR>

To deploy and run on vLLM:

bash llama_fp8_deploy_vllm.sh <YOUR_HF_CKPT_DIR>

If you want to run post-training quantization with TensorRT Model Optimizer for your selected models, check here.