Skip to content

Latest commit

 

History

History
65 lines (41 loc) · 3.1 KB

SERVING.md

File metadata and controls

65 lines (41 loc) · 3.1 KB

🔧 Serving Open-Sourced Models

We use VLLM and FastChat to serve open-source models. We prepare nvidia-docker images for all the serving.

Please refer to the documentation of VLLM and FastChat if you do not wish to use docker for serving.

Prepare your model

Download the model you want to evaluate from Huggingface.

cd $DIR_TO_SAVE_MODELS
git lfs install
git clone [email protected]:<MODEL ID> # example: git clone [email protected]:meta-llama/Llama-2-13b-chat-hf

VLLM

We use VLLM to serve all open-source LLMs we evaluated, except CodeLLaMA, due to a bug.

Please modify scripts/serve/run_vllm_serve.sh to serve an huggingface-compatible model. You can modify the following settings in the script:

  • MODEL_DIR (required): set this to your $DIR_TO_SAVE_MODELS.
  • MODEL_NAME (required): name of your model (e.g., Llama-2-13b-chat-hf if you run git clone [email protected]:meta-llama/Llama-2-13b-chat-hf in the previous step).
  • N_GPUS (required): Number of GPUs you would like to use for tensor parallel
  • CUDA_VISIBLE_DEVICES (required): GPU IDs, separated by a comma, that you want to use. For example, 0,1,2,3.
  • PORT (optional): The port you want to serve your LLM.

After you set all the above correctly, you can run scripts/serve/run_vllm_serve.sh to spin up a server with OpenAI-compatible API, available at http://localhost:$PORT.

You can test whether the server is successfully started by running:

curl http://localhost:$PORT/v1/models

NOTE: we renamed the clone model name for lemur-chat model by doing: mv lemur-70b-chat-v1 llama-2-lemur-70b-chat-v1 since VLLM and FastChat use this directory name to match the corresponding chat template.

FastChat

We serve all the CodeLLaMA models using scripts/serve/run_fastchat_serve.sh. The variables that need to be configured are the same as VLLM above.

FAQ

What if the GPU server (server-A) I have access to for model serving does not have access to the Internet for API-based feedback-provider? I have another machine without a GPU that has access to the Internet (server-B).

First, make sure that you can ssh from server-A to server-B.

Then you can use ssh's built-in proxy functionality by running:

# on server-A
ssh -N -L 0.0.0.0:$PORT:localhost:$PORT server-B

# If you installed autossh (recommended) that allows for automatic reconnection, you can replace the above command with the following:
autossh -M 0 -N -L 0.0.0.0:$PORT:localhost:$PORT server-B

Then you should be able to access the LLM served by VLLM or FastChat on server-B:

# on server-B
curl http://localhost:$PORT/v1/models