From 9b3e12dbc820a8b72bcd7d712a8c9a54facf97df Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E7=9F=B3=E6=99=93=E4=BC=9F?=
 <39303645+Shixiaowei02@users.noreply.github.com>
Date: Mon, 4 Dec 2023 18:59:41 +0800
Subject: [PATCH] Update TensorRT-LLM (#546)

---
 docs/source/2023-05-19-how-to-debug.md |  2 +-
 docs/source/blogs/H100vsA100.md        |  6 ++--
 docs/source/gpt_runtime.md             |  2 +-
 docs/source/index.rst                  | 13 ++++++++
 docs/source/installation.md            | 12 ++++----
 docs/source/memory.md                  |  4 +--
 docs/source/performance.md             | 20 +++++++-----
 examples/gpt/README.md                 | 42 ++++++++++++++++++++++++++
 8 files changed, 80 insertions(+), 21 deletions(-)

diff --git a/docs/source/2023-05-19-how-to-debug.md b/docs/source/2023-05-19-how-to-debug.md
index 7eca250e1..858d23994 100644
--- a/docs/source/2023-05-19-how-to-debug.md
+++ b/docs/source/2023-05-19-how-to-debug.md
@@ -58,7 +58,7 @@ print(outputs.keys())
 print(outputs['inter'])
 ```
 
-Here is the [full example](../../tests/test_debugging_api.py).
+Here is the [full example](source:tests/test_debugging_api.py).
 
 
 ## Debug on E2E models
diff --git a/docs/source/blogs/H100vsA100.md b/docs/source/blogs/H100vsA100.md
index c331a1402..86112f4e8 100644
--- a/docs/source/blogs/H100vsA100.md
+++ b/docs/source/blogs/H100vsA100.md
@@ -2,7 +2,7 @@
 
 
 
-# H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token.
+# H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
 
 TensorRT-LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x max throughput and 4.4x faster 1st token latency than A100**. H100 FP8 is able to achieve over 10,000 output tok/s at [peak throughput](https://nvidia.github.io/TensorRT-LLM/performance.html#h100-gpus-fp8) for 64 concurrent requests, while maintaining a 1st token latency of 100ms.  For [min-latency](https://nvidia.github.io/TensorRT-LLM/performance.html#id1) applications, TRT-LLM H100 can achieve less than 10ms to 1st token latency.
 
@@ -32,10 +32,10 @@ The full data behind these charts & tables and including larger models with high
 
 Stay tuned for a highlight on Llama coming soon!
 
-#### MLPerf on H100 with FP8
+## MLPerf on H100 with FP8
 In the most recent MLPerf results, NVIDIA demonstrated up to 4.5x speedup in model inference performance on the NVIDIA H100 compared to previous results on the NVIDIA A100 Tensor Core GPU. Using the same data types, the H100 showed a 2x increase over the A100. Switching to FP8 resulted in yet another 2x increase in speed.
 
-#### What is H100 FP8?
+## What is H100 FP8?
 H100 is NVIDIA's next-generation, highest-performing data center GPU. Based on the NVIDIA Hopper GPU architecture, H100 accelerates AI training and inference, HPC, and data analytics applications in cloud data centers, servers, systems at the edge, and workstations. Providing native support for FP8 data types H100 can double performance and halve memory consumption, compared to 16-bit floating point options on H100.
 
 FP8 specification introduced in the paper [FP8 Formats for Deep Learning](https://arxiv.org/abs/2209.05433) can be used to speed up training as well as inference with post-training-quantization of models trained using 16-bit formats. The specification consists of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). The recommended use of FP8 encodings is E4M3 for weight and activation tensors, and E5M2 for gradient tensors.
diff --git a/docs/source/gpt_runtime.md b/docs/source/gpt_runtime.md
index eeea339c7..cd3ff1edc 100644
--- a/docs/source/gpt_runtime.md
+++ b/docs/source/gpt_runtime.md
@@ -173,7 +173,7 @@ MPI_Init(&argc, &argv);
 // Get the number of ranks (size of the world).
 int worldSize;
 MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
-
+
 // Get the unique identifier for each rank.
 int rank;
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 6d15d7cea..a8cff9d3f 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -15,6 +15,7 @@ Welcome to TensorRT-LLM's documentation!
    batch_manager.md
    gpt_attention.md
    precision.md
+   installation.md
    performance.md
    2023-05-19-how-to-debug.md
    2023-05-17-how-to-add-a-new-model.md
@@ -65,3 +66,15 @@ Indices and tables
 * :ref:`genindex`
 * :ref:`modindex`
 * :ref:`search`
+
+
+Blogs
+----------
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Blogs
+   :hidden:
+
+   blogs/H100vsA100.md
+   blogs/H200launch.md
diff --git a/docs/source/installation.md b/docs/source/installation.md
index 1cb608084..6cbc185a7 100644
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -1,4 +1,4 @@
-# Table of Contents
+# Build TensorRT-LLM
 
 - [Overview](#overview)
 - [Fetch the Sources](#fetch-the-sources)
@@ -153,8 +153,8 @@ The list of supported architectures can be found in the
 
 ### Build the Python Bindings for the C++ Runtime
 
-The C++ Runtime, in particular, [`GptSession`](../../cpp/include/tensorrt_llm/runtime/gptSession.h) can be exposed to
-Python via [bindings](../../cpp/tensorrt_llm/pybind/bindings.cpp). This is currently an opt-in feature which needs to be
+The C++ Runtime, in particular, [`GptSession`](source:cpp/include/tensorrt_llm/runtime/gptSession.h) can be exposed to
+Python via [bindings](source:cpp/tensorrt_llm/pybind/bindings.cpp). This is currently an opt-in feature which needs to be
 explicitly activated during compilation time. The corresponding option `--python_bindings` can be specified
 to `build_wheel.py` in the standard way:
 
@@ -164,7 +164,7 @@ python3 ./scripts/build_wheel.py --python_bindings --trt_root /usr/local/tensorr
 
 After installing the resulting wheel as described above, the C++ Runtime bindings will be available in
 package `tensorrt_llm.bindings`. Running `help` on this package in a Python interpreter will provide on overview of the
-relevant classes. The [associated unit tests](../../tests/bindings) should also be consulted for understanding the API.
+relevant classes. The [associated unit tests](source:tests/bindings) should also be consulted for understanding the API.
 
 ### Link with the TensorRT-LLM C++ Runtime
 
@@ -209,5 +209,5 @@ headers contained under `cpp` should not be included directly since they might
 change in future versions.
 
 For examples of how to use the C++ runtime, see the unit tests in
-[gptSessionTest.cpp](cpp/tests/runtime/gptSessionTest.cpp) and the related
-[CMakeLists.txt](cpp/tests/CMakeLists.txt) file.
+[gptSessionTest.cpp](source:cpp/tests/runtime/gptSessionTest.cpp) and the related
+[CMakeLists.txt](source:cpp/tests/CMakeLists.txt) file.
diff --git a/docs/source/memory.md b/docs/source/memory.md
index 069dfc40f..c846b192b 100644
--- a/docs/source/memory.md
+++ b/docs/source/memory.md
@@ -75,8 +75,8 @@ The Python runtime allocates KV cache tensors based on the parameters of the `Ge
 
 ## Memory pool
 
-TensorRT-LLM C++ runtime is using stream-ordered memory allocator to allocate and free buffers, see [BufferManager::initMemoryPool](cpp/tensorrt_llm/runtime/bufferManager.cpp), which uses the default memory pool managed by the CUDA driver. When a `GptSession` object is destroyed, memory is returned to the memory pool and can be reused by the next instance of a `GptSession` object. Memory will be released from the pool if it is required for other memory allocations.
-However, `nvidia-smi` may still show high memory occupation after memory is returned to the CUDA driver's memory pool. This should not be a concern and is intended behavior. The amount of reserved and free memory in the pool can be inspected by [BufferManager::memoryPoolReserved())](cpp/tensorrt_llm/runtime/bufferManager.cpp) and [BufferManager::memoryPoolFree())](cpp/tensorrt_llm/runtime/bufferManager.cpp), respectively.
+TensorRT-LLM C++ runtime is using stream-ordered memory allocator to allocate and free buffers, see [BufferManager::initMemoryPool](source:cpp/tensorrt_llm/runtime/bufferManager.cpp), which uses the default memory pool managed by the CUDA driver. When a `GptSession` object is destroyed, memory is returned to the memory pool and can be reused by the next instance of a `GptSession` object. Memory will be released from the pool if it is required for other memory allocations.
+However, `nvidia-smi` may still show high memory occupation after memory is returned to the CUDA driver's memory pool. This should not be a concern and is intended behavior. The amount of reserved and free memory in the pool can be inspected by [BufferManager::memoryPoolReserved())](source:cpp/tensorrt_llm/runtime/bufferManager.cpp) and [BufferManager::memoryPoolFree())](source:cpp/tensorrt_llm/runtime/bufferManager.cpp), respectively.
 
 ## Known Issues
 
diff --git a/docs/source/performance.md b/docs/source/performance.md
index 0e4d0e843..30be549e5 100644
--- a/docs/source/performance.md
+++ b/docs/source/performance.md
@@ -10,7 +10,7 @@ performance that can be delivered by TensorRT-LLM.
 ## Methodology
 
 The different performance numbers below were collected using the methodology
-described in the benchmarks [folder](../../benchmarks/).
+described in the benchmarks [folder](source:benchmarks/).
 
 ## High Throughput
 
@@ -145,6 +145,7 @@ include a more efficient implementation that runs single Matmul + SwiGLU fused k
 ## Reproducing Benchmarked Results
 
 ### Building the TensorRT-LLM Container
+
 ---
 In order to benchmark TensorRT-LLM, you will need to follow the [Quick Start](../../README.md#quick-start)
 build process to create a baseline container for building a wheel. Additionally, the development
@@ -231,7 +232,8 @@ in [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
 
 ## Benchmarking per Model
 
-#### GPT-J 6B
+### GPT-J 6B
+
 ---
 ```shell
 python examples/gptj/build.py \
@@ -255,7 +257,7 @@ python examples/gptj/build.py \
 	--enable_two_optimization_profiles
 ```
 
-##### Throughput Benchmark
+#### Throughput Benchmark
 
 ```shell
 in_out_sizes=("64:128,128" "64:128,2048" "64:2048,128" "64:2048,2048")
@@ -269,7 +271,7 @@ do
 done
 ```
 
-##### First Token Latency Benchmark
+#### First Token Latency Benchmark
 
 ```shell
 in_out_sizes=("64:128,1" "64:2048,1")
@@ -285,6 +287,7 @@ done
 
 
 ### Llama2-7b
+
 ---
 ```shell
 pip install -r examples/llama/requirements.txt
@@ -313,7 +316,7 @@ python examples/llama/build.py \
 	--hidden_act silu
 ```
 
-##### Throughput Benchmark
+#### Throughput Benchmark
 
 ```shell
 in_out_sizes=("64:128,128" "64:128,2048" "64:2048,128" "32:2048,2048")
@@ -326,7 +329,7 @@ do
 	./cpp/build/benchmarks/gptSessionBenchmark --model llama --engine_dir /tmp/engines/llama/7b --warm_up 1 --batch_size $batch_size --duration 0 --num_runs 5 --input_output_len $in_out_dims
 done
 ```
-##### First Token Latency Benchmark
+#### First Token Latency Benchmark
 
 ```shell
 in_out_sizes=("64:128,1" "32:2048,1")
@@ -372,7 +375,7 @@ python examples/llama/build.py \
 	--multiple_of 4096
 ```
 
-##### Throughput Benchmark
+#### Throughput Benchmark
 
 ```shell
 in_out_sizes=("64:128,128" "64:128,2048" "64:2048,128" "64:2048,2048")
@@ -386,7 +389,7 @@ do
 done
 ```
 
-##### First Token Latency Benchmark
+#### First Token Latency Benchmark
 
 ```shell
 in_out_sizes=("64:128,1" "64:128,1")
@@ -402,6 +405,7 @@ done
 
 
 ### Falcon-180B
+
 ---
 
 Benchmarking Falcon-180B requires a custom engine per batch size, input/output sequence length due
diff --git a/examples/gpt/README.md b/examples/gpt/README.md
index 1fda50d22..3e519488a 100644
--- a/examples/gpt/README.md
+++ b/examples/gpt/README.md
@@ -535,3 +535,45 @@ python3 build.py --model_dir=./c-model/gpt2/2-gpu --dtype bfloat16 --world_size=
 
 mpirun -np 2 python3 ../summarize.py --engine_dir trt_engine/gpt2/bfloat16/2-gpu --hf_model_dir gpt2 --batch_size 10 --test_trt_llm --check_accuracy --tensorrt_llm_rouge1_threshold=14 --dataset_path ./dataset --no_add_special_tokens
 ```
+
+### Run LoRA with the Nemo checkpoint
+
+```bash
+git clone https://huggingface.co/nvidia/GPT-2B-001
+python3 nemo_ckpt_convert.py -i GPT-2B-001/GPT-2B-001_bf16_tp1.nemo -o /tmp/c-model/gpt-next-2B --tensor-parallelism 1 --storage-type bfloat16
+
+python3 build.py --model_dir=/tmp/c-model/gpt-next-2B/1-gpu/ \
+                 --dtype bfloat16 \
+                 --remove_input_padding \
+                 --use_gpt_attention_plugin \
+                 --output_dir /tmp/gpt-next-2B/ \
+                 --use_lora_plugin \
+                 --max_batch_size 4 \
+                 --max_input_len 512 \
+                 --max_output_len 50 \
+                 --lora_target_modules "attn_qkv"
+
+python3 nemo_lora_convert.py  -i tmp_nemo_ckpt/gpt2b_lora-900.nemo -o /tmp/gpt-next-2B/ -t bf16  # Assume lora weights are in tmp_nemo_ckpt/gpt2b_lora-900.nemo
+
+python3 ../run.py --max_output_len=20 \
+                  --vocab_file=/tmp/c-model/gpt-next-2B/1-gpu/tokenizer.model \
+                  --engine_dir /tmp/gpt-next-2B/ \
+                  --lora_dir /tmp/gpt-next-2B/ \
+                  --lora_task_uids "lora" \
+                  --no_add_special_tokens \
+                  --input_text "After Washington had returned to Williamsburg, Dinwiddie ordered him to lead a larger force to assist Trent in his work. While en route, Washington learned of Trent's retreat. Since Tanaghrisson had promised support to the British, Washington continued toward Fort Duquesne and met with the Mingo leader. Learning of a French scouting party in the area, Washington, with Tanaghrisson and his party, surprised the Canadians on May 28 in what became known as the Battle of Jumonville Glen. They killed many of the Canadians, including their commanding officer, Joseph Coulon de Jumonville, whose head was reportedly split open by Tanaghrisson with a tomahawk. The historian Fred Anderson suggests that Tanaghrisson was acting to gain the support of the British and regain authority over his own people. They had been inclined to support the French, with whom they had long trading relationships. One of Tanaghrisson's men told Contrecoeur that Jumonville had been killed by British musket fire. Question: Upon learning of a French scounting party in the area, what did Washington do? Answer:"
+```
+
+Users who want to skip LoRA module may pass uid -1 with `--lora_task_uids -1`.
+In that case, the model will not run the LoRA module and the results will be
+different.
+
+```bash
+python3 ../run.py --max_output_len=20 \
+                  --vocab_file=/tmp/c-model/gpt-next-2B/1-gpu/tokenizer.model \
+                  --engine_dir /tmp/gpt-next-2B/ \
+                  --lora_dir /tmp/gpt-next-2B/ \
+                  --lora_task_uids "-1" \
+                  --no_add_special_tokens \
+                  --input_text "After Washington had returned to Williamsburg, Dinwiddie ordered him to lead a larger force to assist Trent in his work. While en route, Washington learned of Trent's retreat. Since Tanaghrisson had promised support to the British, Washington continued toward Fort Duquesne and met with the Mingo leader. Learning of a French scouting party in the area, Washington, with Tanaghrisson and his party, surprised the Canadians on May 28 in what became known as the Battle of Jumonville Glen. They killed many of the Canadians, including their commanding officer, Joseph Coulon de Jumonville, whose head was reportedly split open by Tanaghrisson with a tomahawk. The historian Fred Anderson suggests that Tanaghrisson was acting to gain the support of the British and regain authority over his own people. They had been inclined to support the French, with whom they had long trading relationships. One of Tanaghrisson's men told Contrecoeur that Jumonville had been killed by British musket fire. Question: Upon learning of a French scounting party in the area, what did Washington do? Answer:"
+```