Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create distributed.md #1438

Merged
merged 15 commits into from
Jan 6, 2025
17 changes: 17 additions & 0 deletions .ci/scripts/run-docs
Original file line number Diff line number Diff line change
Expand Up @@ -125,3 +125,20 @@ if [ "$1" == "native" ]; then
bash -x ./run-native.sh
echo "::endgroup::"
fi

if [ "$1" == "distributed" ]; then

echo "::group::Create script to run distributed"
python3 torchchat/utils/scripts/updown.py --file docs/distributed.md > ./run-distributed.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./run-distributed.sh
echo "::endgroup::"

echo "::group::Run distributed"
echo "*******************************************"
cat ./run-distributed.sh
echo "*******************************************"
bash -x ./run-distributed.sh
echo "::endgroup::"
fi
116 changes: 116 additions & 0 deletions docs/distributed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Distributed Inference with torchchat

torchchat suports distributed inference for large language models (LLMs) on GPUs seamlessly.
mikekgfb marked this conversation as resolved.
Show resolved Hide resolved
At present, torchchat supports distributed inference using Python only.

## Installation
The following steps require that you have [Python 3.10](https://www.python.org/downloads/release/python-3100/) installed.

> [!TIP]
> torchchat uses the latest changes from various PyTorch projects so it's highly recommended that you use a venv (by using the commands below) or CONDA.

[skip default]: begin
```bash
git clone https://github.com/pytorch/torchchat.git
cd torchchat
python3 -m venv .venv
source .venv/bin/activate
./install/install_requirements.sh
```
[skip default]: end

[shell default]: ./install/install_requirements.sh

## Enabling Distributed torchchat Inference

To enable distributed inference, use the option `--distributed`. In addition, `--tp <num>` and `--pp <num>`
allow users to specify the types of parallelism to use.
mikekgfb marked this conversation as resolved.
Show resolved Hide resolved

<!--
[skip default]: begin
## Generate output (requires testing and review by mreso)
mikekgfb marked this conversation as resolved.
Show resolved Hide resolved

To generate output using distributed inference with 4 GPUs, you can use:
```
python3 torchchat.py generate llama3.1 --distributed --tp 2 --pp 2 --prompt "write me a story about a boy and his bear"
```
[skip default]: end
-->

## Chat with Distributed torchchat Inference

This mode allows you to chat with an LLM in an interactive fashion with distributed Inference. The following example uses 4 GPUs:

[skip default]: begin
```bash
python3 torchchat.py chat llama3.1 --max-new-tokens 10 --distributed --tp 2 --pp 2
```
[skip default]: end


## A Server with Distributed torchchat Inference

This mode exposes a REST API for interacting with a model.
The server follows the [OpenAI API specification](https://platform.openai.com/docs/api-reference/chat) for chat completions.

To test out the REST API, **you'll need 2 terminals**: one to host the server, and one to send the request.

In one terminal, start the server to run with 4 GPUs:

[skip default]: begin

```bash
python3 torchchat.py server llama3.1 --distributed --tp 2 --pp 2
```
[skip default]: end

<!--
[shell default]: python3 torchchat.py server llama3.1 --distributed --tp 2 --pp 2 & server_pid=$! ; sleep 180 # wait for server to be ready to accept requests
-->

In another terminal, query the server using `curl`. Depending on the model configuration, this query might take a few minutes to respond.

> [!NOTE]
> Since this feature is under active development, not every parameter is consumed. See api/api.py for details on
> which request parameters are implemented. If you encounter any issues, please comment on the [tracking Github issue](https://github.com/pytorch/torchchat/issues/973).

<details>
<summary>Example Query</summary>

Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will await the full response from the server.

**Example Input + Output**

```
curl http://127.0.0.1:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1",
"stream": "true",
"max_tokens": 200,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}'
```
[skip default]: begin
```
{"response":" I'm a software developer with a passion for building innovative and user-friendly applications. I have experience in developing web and mobile applications using various technologies such as Java, Python, and JavaScript. I'm always looking for new challenges and opportunities to learn and grow as a developer.\n\nIn my free time, I enjoy reading books on computer science and programming, as well as experimenting with new technologies and techniques. I'm also interested in machine learning and artificial intelligence, and I'm always looking for ways to apply these concepts to real-world problems.\n\nI'm excited to be a part of the developer community and to have the opportunity to share my knowledge and experience with others. I'm always happy to help with any questions or problems you may have, and I'm looking forward to learning from you as well.\n\nThank you for visiting my profile! I hope you find my information helpful and interesting. If you have any questions or would like to discuss any topics, please feel free to reach out to me. I"}
```

[skip default]: end

<!--
[shell default]: kill ${server_pid}
-->

</details>

[end default]: end
Loading