From 3be38cc38fffb8ae690da03c8679d8c85b47afac Mon Sep 17 00:00:00 2001
From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Date: Sun, 22 Dec 2024 14:57:54 -0800
Subject: [PATCH 01/11] Create distributed.md

Initial documentation for use of distributed inference w/ torchchat.
@mreso please review and update as appropriate.
---
 docs/distributed.md | 113 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 113 insertions(+)
 create mode 100644 docs/distributed.md
diff --git a/docs/distributed.md b/docs/distributed.md
new file mode 100644
index 000000000..b5a2f73d1
--- /dev/null
+++ b/docs/distributed.md
@@ -0,0 +1,113 @@
+# Distributed Inference with torchchat
+
+torchchat suports distributed inference for large language models (LLMs) on GPUs seamlessly. 
+At present, torchchat supports distributed inference using Python only.
+
+## Installation
+The following steps require that you have [Python 3.10](https://www.python.org/downloads/release/python-3100/) installed.
+
+> [!TIP]
+> torchchat uses the latest changes from various PyTorch projects so it's highly recommended that you use a venv (by using the commands below) or CONDA.
+
+[skip default]: begin
+```bash
+git clone https://github.com/pytorch/torchchat.git
+cd torchchat
+python3 -m venv .venv
+source .venv/bin/activate
+./install/install_requirements.sh
+```
+[skip default]: end
+
+[shell default]: ./install/install_requirements.sh
+
+##Enabling Distributed torchchat Inference
+
+To enable distributed inference, use the option `--distributed`.  In addition, `--tp <num>` and `--pp <num>` 
+allow users to specify the types of parallelism to use.
+
+<!--
+[skip default]: begin
+## Generate output (requires testing and review by mreso)
+
+To generate output using distributed inference with 4 GPUs, you can use:
+```
+python3  torchchat.py generate llama3.1   --distributed --tp 2 --pp 2 --prompt "write me a story about a boy and his bear"
+```
+[skip default]: end
+-->
+
+## CHat with Distributed torchchat Inference
+
+### Chat
+This mode allows you to chat with an LLM in an interactive fashion with distributed Inference.  The following example uses 4 GPUs:
+
+[skip default]: begin
+```bash
+python3 torchchat.py chat llama3.1  --max-new-tokens 10  --distributed --tp 2 --pp 2
+```
+[skip default]: end
+
+
+## A Server with Distributed torchchat Inference
+
+This mode exposes a REST API for interacting with a model.
+The server follows the [OpenAI API specification](https://platform.openai.com/docs/api-reference/chat) for chat completions.
+
+To test out the REST API, **you'll need 2 terminals**: one to host the server, and one to send the request.
+
+In one terminal, start the server to run with 4 GPUs:
+
+[skip default]: begin
+
+```bash
+python3 torchchat.py server llama3.1   --distributed --tp 2 --pp 2
+```
+[skip default]: end
+
+[shell default]: python3 torchchat.py server llama3.1 --distributed --tp 2 --pp 2 & server_pid=$! ; sleep 180 # wait for server to be ready to accept requests
+
+In another terminal, query the server using `curl`. Depending on the model configuration, this query might take a few minutes to respond.
+
+> [!NOTE]
+> Since this feature is under active development, not every parameter is consumed. See api/api.py for details on
+> which request parameters are implemented. If you encounter any issues, please comment on the [tracking Github issue](https://github.com/pytorch/torchchat/issues/973).
+
+<details>
+<summary>Example Query</summary>
+
+Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will await the full response from the server.
+
+**Example Input + Output**
+
+```
+curl http://127.0.0.1:5000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama3.1",
+    "stream": "true",
+    "max_tokens": 200,
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful assistant."
+      },
+      {
+        "role": "user",
+        "content": "Hello!"
+      }
+    ]
+  }'
+```
+[skip default]: begin
+```
+{"response":" I'm a software developer with a passion for building innovative and user-friendly applications. I have experience in developing web and mobile applications using various technologies such as Java, Python, and JavaScript. I'm always looking for new challenges and opportunities to learn and grow as a developer.\n\nIn my free time, I enjoy reading books on computer science and programming, as well as experimenting with new technologies and techniques. I'm also interested in machine learning and artificial intelligence, and I'm always looking for ways to apply these concepts to real-world problems.\n\nI'm excited to be a part of the developer community and to have the opportunity to share my knowledge and experience with others. I'm always happy to help with any questions or problems you may have, and I'm looking forward to learning from you as well.\n\nThank you for visiting my profile! I hope you find my information helpful and interesting. If you have any questions or would like to discuss any topics, please feel free to reach out to me. I"}
+```
+
+[skip default]: end
+
+[shell default]: kill ${server_pid}
+
+</details>
+
+[end default]: end

From 8e2ca2de4f2be86bd74eca7cea1232e2fb8e6062 Mon Sep 17 00:00:00 2001
From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Date: Sun, 22 Dec 2024 15:03:20 -0800
Subject: [PATCH 02/11] Add support for extracting distributed inference tests
 in run-docs

Add support for extracting distributed inference tests in run-docs
---
 .ci/scripts/run-docs | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/.ci/scripts/run-docs b/.ci/scripts/run-docs
index 6f5ee46c7..521cfa811 100755
--- a/.ci/scripts/run-docs
+++ b/.ci/scripts/run-docs
@@ -125,3 +125,20 @@ if [ "$1" == "native" ]; then
         bash -x ./run-native.sh
         echo "::endgroup::"
 fi
+
+if [ "$1" == "distributed" ]; then
+
+        echo "::group::Create script to run distributed"
+        python3 torchchat/utils/scripts/updown.py --file docs/distributed.md > ./run-distributed.sh
+        # for good measure, if something happened to updown processor,
+        # and it did not error out, fail with an exit 1
+        echo "exit 1" >> ./run-distributed.sh
+        echo "::endgroup::"
+
+        echo "::group::Run distributed"
+        echo "*******************************************"
+        cat ./run-distributed.sh
+        echo "*******************************************"
+        bash -x ./run-distributed.sh
+        echo "::endgroup::"
+fi

From 66dd0252141ea50a5f5107b200083d645bd9fee4 Mon Sep 17 00:00:00 2001
From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Date: Sun, 22 Dec 2024 23:56:43 -0800
Subject: [PATCH 03/11] Update distributed.md

---
 docs/distributed.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/distributed.md b/docs/distributed.md
index b5a2f73d1..28a691c50 100644
--- a/docs/distributed.md
+++ b/docs/distributed.md
@@ -65,7 +65,9 @@ python3 torchchat.py server llama3.1   --distributed --tp 2 --pp 2
 ```
 [skip default]: end
 
+<!--
 [shell default]: python3 torchchat.py server llama3.1 --distributed --tp 2 --pp 2 & server_pid=$! ; sleep 180 # wait for server to be ready to accept requests
+-->
 
 In another terminal, query the server using `curl`. Depending on the model configuration, this query might take a few minutes to respond.
 

From 8f4b3120ff6c86cdf28ac628340f70220a285357 Mon Sep 17 00:00:00 2001
From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Date: Mon, 23 Dec 2024 00:00:24 -0800
Subject: [PATCH 04/11] Update distributed.md

---
 docs/distributed.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/distributed.md b/docs/distributed.md
index 28a691c50..ef9efe055 100644
--- a/docs/distributed.md
+++ b/docs/distributed.md
@@ -37,9 +37,8 @@ python3  torchchat.py generate llama3.1   --distributed --tp 2 --pp 2 --prompt "
 [skip default]: end
 -->
 
-## CHat with Distributed torchchat Inference
+## Chat with Distributed torchchat Inference
 
-### Chat
 This mode allows you to chat with an LLM in an interactive fashion with distributed Inference.  The following example uses 4 GPUs:
 
 [skip default]: begin
@@ -108,7 +107,9 @@ curl http://127.0.0.1:5000/v1/chat/completions \
 
 [skip default]: end
 
+<!--
 [shell default]: kill ${server_pid}
+-->
 
 </details>
 

From f3ff0143537e9458ae194cfc34b873b28bbe6f4f Mon Sep 17 00:00:00 2001
From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Date: Mon, 23 Dec 2024 00:01:39 -0800
Subject: [PATCH 05/11] Update distributed.md

---
 docs/distributed.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/distributed.md b/docs/distributed.md
index ef9efe055..6766f2205 100644
--- a/docs/distributed.md
+++ b/docs/distributed.md
@@ -21,7 +21,7 @@ source .venv/bin/activate
 
 [shell default]: ./install/install_requirements.sh
 
-##Enabling Distributed torchchat Inference
+## Enabling Distributed torchchat Inference
 
 To enable distributed inference, use the option `--distributed`.  In addition, `--tp <num>` and `--pp <num>` 
 allow users to specify the types of parallelism to use.

From 3af0f525fc2e0f7d776946432b89421d0987c58b Mon Sep 17 00:00:00 2001
From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Date: Tue, 24 Dec 2024 03:06:55 -0800
Subject: [PATCH 06/11] Update docs/distributed.md

Co-authored-by: Matthias Reso <13337103+mreso@users.noreply.github.com>
---
 docs/distributed.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/distributed.md b/docs/distributed.md
index 6766f2205..da25cca39 100644
--- a/docs/distributed.md
+++ b/docs/distributed.md
@@ -1,6 +1,6 @@
 # Distributed Inference with torchchat
 
-torchchat suports distributed inference for large language models (LLMs) on GPUs seamlessly. 
+torchchat supports distributed inference for large language models (LLMs) on GPUs seamlessly. 
 At present, torchchat supports distributed inference using Python only.
 
 ## Installation

From b65f0e4baf4727ced50f441a5ac01276b52c0dc7 Mon Sep 17 00:00:00 2001
From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Date: Tue, 24 Dec 2024 03:07:22 -0800
Subject: [PATCH 07/11] Update docs/distributed.md

Co-authored-by: Matthias Reso <13337103+mreso@users.noreply.github.com>
---
 docs/distributed.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/distributed.md b/docs/distributed.md
index da25cca39..1c42cea31 100644
--- a/docs/distributed.md
+++ b/docs/distributed.md
@@ -24,7 +24,7 @@ source .venv/bin/activate
 ## Enabling Distributed torchchat Inference
 
 To enable distributed inference, use the option `--distributed`.  In addition, `--tp <num>` and `--pp <num>` 
-allow users to specify the types of parallelism to use.
+allow users to specify the types of parallelism to use (where tp refers to tensor parallelism and pp to pipeline parallelism).
 
 <!--
 [skip default]: begin

From f401a2f42423b3913628a734f42f37f2e3ff89b0 Mon Sep 17 00:00:00 2001
From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Date: Tue, 24 Dec 2024 03:13:50 -0800
Subject: [PATCH 08/11] Update distributed.md

Uncommenting  section about generate subcommand w/ distributed inference after review by @mreso
Also, Added HF login to make this fully self-contained
---
 docs/distributed.md | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/docs/distributed.md b/docs/distributed.md
index 1c42cea31..c5a572f80 100644
--- a/docs/distributed.md
+++ b/docs/distributed.md
@@ -21,21 +21,30 @@ source .venv/bin/activate
 
 [shell default]: ./install/install_requirements.sh
 
+## Download Weights
+Most models use Hugging Face as the distribution channel, so you will need to create a Hugging Face account. Create a Hugging Face user access token as documented here with the write role.
+
+Log into Hugging Face:
+
+[prefix default]: HF_TOKEN="${SECRET_HF_TOKEN_PERIODIC}"
+
+```
+huggingface-cli login
+```
+
 ## Enabling Distributed torchchat Inference
 
 To enable distributed inference, use the option `--distributed`.  In addition, `--tp <num>` and `--pp <num>` 
 allow users to specify the types of parallelism to use (where tp refers to tensor parallelism and pp to pipeline parallelism).
 
-<!--
-[skip default]: begin
-## Generate output (requires testing and review by mreso)
+
+## Generate Output with Distributed torchchat Inference
 
 To generate output using distributed inference with 4 GPUs, you can use:
 ```
-python3  torchchat.py generate llama3.1   --distributed --tp 2 --pp 2 --prompt "write me a story about a boy and his bear"
+python3 torchchat.py generate llama3.1 --distributed --tp 2 --pp 2 --prompt "write me a story about a boy and his bear"
 ```
-[skip default]: end
--->
+
 
 ## Chat with Distributed torchchat Inference
 
@@ -43,7 +52,7 @@ This mode allows you to chat with an LLM in an interactive fashion with distribu
 
 [skip default]: begin
 ```bash
-python3 torchchat.py chat llama3.1  --max-new-tokens 10  --distributed --tp 2 --pp 2
+python3 torchchat.py chat llama3.1 --max-new-tokens 10  --distributed --tp 2 --pp 2
 ```
 [skip default]: end
 

From 1bb3303c7c3a5af94c7931a6262e3221edfd689f Mon Sep 17 00:00:00 2001
From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Date: Tue, 24 Dec 2024 03:14:52 -0800
Subject: [PATCH 09/11] Update distributed.md

Wording
---
 docs/distributed.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/distributed.md b/docs/distributed.md
index c5a572f80..9d0989353 100644
--- a/docs/distributed.md
+++ b/docs/distributed.md
@@ -21,7 +21,7 @@ source .venv/bin/activate
 
 [shell default]: ./install/install_requirements.sh
 
-## Download Weights
+## Login to HF for Downloading Weights
 Most models use Hugging Face as the distribution channel, so you will need to create a Hugging Face account. Create a Hugging Face user access token as documented here with the write role.
 
 Log into Hugging Face:

From 17e2764bb159545c7f4edcc8c41a957d81b6164b Mon Sep 17 00:00:00 2001
From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Date: Tue, 24 Dec 2024 03:16:53 -0800
Subject: [PATCH 10/11] Update distributed.md

Wording and formatting
---
 docs/distributed.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/distributed.md b/docs/distributed.md
index 9d0989353..3d34d7672 100644
--- a/docs/distributed.md
+++ b/docs/distributed.md
@@ -35,7 +35,7 @@ huggingface-cli login
 ## Enabling Distributed torchchat Inference
 
 To enable distributed inference, use the option `--distributed`.  In addition, `--tp <num>` and `--pp <num>` 
-allow users to specify the types of parallelism to use (where tp refers to tensor parallelism and pp to pipeline parallelism).
+allow users to specify the types of parallelism to use where tp refers to tensor parallelism and pp to pipeline parallelism.
 
 
 ## Generate Output with Distributed torchchat Inference
@@ -52,7 +52,7 @@ This mode allows you to chat with an LLM in an interactive fashion with distribu
 
 [skip default]: begin
 ```bash
-python3 torchchat.py chat llama3.1 --max-new-tokens 10  --distributed --tp 2 --pp 2
+python3 torchchat.py chat llama3.1 --max-new-tokens 10 --distributed --tp 2 --pp 2
 ```
 [skip default]: end
 
@@ -69,7 +69,7 @@ In one terminal, start the server to run with 4 GPUs:
 [skip default]: begin
 
 ```bash
-python3 torchchat.py server llama3.1   --distributed --tp 2 --pp 2
+python3 torchchat.py server llama3.1 --distributed --tp 2 --pp 2
 ```
 [skip default]: end
 

From b1c7e5ac5e73f992b22d03e96241791af99579e0 Mon Sep 17 00:00:00 2001
From: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Date: Sat, 28 Dec 2024 23:12:37 -0800
Subject: [PATCH 11/11] Update build_native.sh

Update to C++11 ABI for AOTI, similar to ET
---
 torchchat/utils/scripts/build_native.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/torchchat/utils/scripts/build_native.sh b/torchchat/utils/scripts/build_native.sh
index 3c2c1c846..a935fa74c 100755
--- a/torchchat/utils/scripts/build_native.sh
+++ b/torchchat/utils/scripts/build_native.sh
@@ -93,7 +93,7 @@ popd
 if [[ "$TARGET" == "et" ]]; then
     cmake -S . -B ./cmake-out -DCMAKE_PREFIX_PATH=`python3 -c 'import torch;print(torch.utils.cmake_prefix_path)'` -DLINK_TORCHAO_OPS="${LINK_TORCHAO_OPS}" -DET_USE_ADAPTIVE_THREADS=ON -DCMAKE_CXX_FLAGS="-D_GLIBCXX_USE_CXX11_ABI=1" -G Ninja
 else
-    cmake -S . -B ./cmake-out -DCMAKE_PREFIX_PATH=`python3 -c 'import torch;print(torch.utils.cmake_prefix_path)'` -DLINK_TORCHAO_OPS="${LINK_TORCHAO_OPS}" -DCMAKE_CXX_FLAGS="-D_GLIBCXX_USE_CXX11_ABI=0" -G Ninja
+    cmake -S . -B ./cmake-out -DCMAKE_PREFIX_PATH=`python3 -c 'import torch;print(torch.utils.cmake_prefix_path)'` -DLINK_TORCHAO_OPS="${LINK_TORCHAO_OPS}" -DCMAKE_CXX_FLAGS="-D_GLIBCXX_USE_CXX11_ABI=1" -G Ninja
 fi
 cmake --build ./cmake-out --target "${TARGET}"_run