Skip to content

Commit

Permalink
Update TensorRT-LLM (#546)
Browse files Browse the repository at this point in the history
  • Loading branch information
Shixiaowei02 authored Dec 4, 2023
1 parent 8dd9c91 commit 9b3e12d
Show file tree
Hide file tree
Showing 8 changed files with 80 additions and 21 deletions.
2 changes: 1 addition & 1 deletion docs/source/2023-05-19-how-to-debug.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ print(outputs.keys())
print(outputs['inter'])
```

Here is the [full example](../../tests/test_debugging_api.py).
Here is the [full example](source:tests/test_debugging_api.py).


## Debug on E2E models
Expand Down
6 changes: 3 additions & 3 deletions docs/source/blogs/H100vsA100.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@


# H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token.
# H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token

TensorRT-LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x max throughput and 4.4x faster 1st token latency than A100**. H100 FP8 is able to achieve over 10,000 output tok/s at [peak throughput](https://nvidia.github.io/TensorRT-LLM/performance.html#h100-gpus-fp8) for 64 concurrent requests, while maintaining a 1st token latency of 100ms. For [min-latency](https://nvidia.github.io/TensorRT-LLM/performance.html#id1) applications, TRT-LLM H100 can achieve less than 10ms to 1st token latency.

Expand Down Expand Up @@ -32,10 +32,10 @@ The full data behind these charts & tables and including larger models with high

Stay tuned for a highlight on Llama coming soon!

#### MLPerf on H100 with FP8
## MLPerf on H100 with FP8
In the most recent MLPerf results, NVIDIA demonstrated up to 4.5x speedup in model inference performance on the NVIDIA H100 compared to previous results on the NVIDIA A100 Tensor Core GPU. Using the same data types, the H100 showed a 2x increase over the A100. Switching to FP8 resulted in yet another 2x increase in speed.

#### What is H100 FP8?
## What is H100 FP8?
H100 is NVIDIA's next-generation, highest-performing data center GPU. Based on the NVIDIA Hopper GPU architecture, H100 accelerates AI training and inference, HPC, and data analytics applications in cloud data centers, servers, systems at the edge, and workstations. Providing native support for FP8 data types H100 can double performance and halve memory consumption, compared to 16-bit floating point options on H100.

FP8 specification introduced in the paper [FP8 Formats for Deep Learning](https://arxiv.org/abs/2209.05433) can be used to speed up training as well as inference with post-training-quantization of models trained using 16-bit formats. The specification consists of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). The recommended use of FP8 encodings is E4M3 for weight and activation tensors, and E5M2 for gradient tensors.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/gpt_runtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ MPI_Init(&argc, &argv);
// Get the number of ranks (size of the world).
int worldSize;
MPI_Comm_size(MPI_COMM_WORLD, &worldSize);

// Get the unique identifier for each rank.
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
Expand Down
13 changes: 13 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Welcome to TensorRT-LLM's documentation!
batch_manager.md
gpt_attention.md
precision.md
installation.md
performance.md
2023-05-19-how-to-debug.md
2023-05-17-how-to-add-a-new-model.md
Expand Down Expand Up @@ -65,3 +66,15 @@ Indices and tables
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`


Blogs
----------

.. toctree::
:maxdepth: 2
:caption: Blogs
:hidden:

blogs/H100vsA100.md
blogs/H200launch.md
12 changes: 6 additions & 6 deletions docs/source/installation.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Table of Contents
# Build TensorRT-LLM

- [Overview](#overview)
- [Fetch the Sources](#fetch-the-sources)
Expand Down Expand Up @@ -153,8 +153,8 @@ The list of supported architectures can be found in the

### Build the Python Bindings for the C++ Runtime

The C++ Runtime, in particular, [`GptSession`](../../cpp/include/tensorrt_llm/runtime/gptSession.h) can be exposed to
Python via [bindings](../../cpp/tensorrt_llm/pybind/bindings.cpp). This is currently an opt-in feature which needs to be
The C++ Runtime, in particular, [`GptSession`](source:cpp/include/tensorrt_llm/runtime/gptSession.h) can be exposed to
Python via [bindings](source:cpp/tensorrt_llm/pybind/bindings.cpp). This is currently an opt-in feature which needs to be
explicitly activated during compilation time. The corresponding option `--python_bindings` can be specified
to `build_wheel.py` in the standard way:

Expand All @@ -164,7 +164,7 @@ python3 ./scripts/build_wheel.py --python_bindings --trt_root /usr/local/tensorr

After installing the resulting wheel as described above, the C++ Runtime bindings will be available in
package `tensorrt_llm.bindings`. Running `help` on this package in a Python interpreter will provide on overview of the
relevant classes. The [associated unit tests](../../tests/bindings) should also be consulted for understanding the API.
relevant classes. The [associated unit tests](source:tests/bindings) should also be consulted for understanding the API.

### Link with the TensorRT-LLM C++ Runtime

Expand Down Expand Up @@ -209,5 +209,5 @@ headers contained under `cpp` should not be included directly since they might
change in future versions.

For examples of how to use the C++ runtime, see the unit tests in
[gptSessionTest.cpp](cpp/tests/runtime/gptSessionTest.cpp) and the related
[CMakeLists.txt](cpp/tests/CMakeLists.txt) file.
[gptSessionTest.cpp](source:cpp/tests/runtime/gptSessionTest.cpp) and the related
[CMakeLists.txt](source:cpp/tests/CMakeLists.txt) file.
4 changes: 2 additions & 2 deletions docs/source/memory.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,8 @@ The Python runtime allocates KV cache tensors based on the parameters of the `Ge

## Memory pool

TensorRT-LLM C++ runtime is using stream-ordered memory allocator to allocate and free buffers, see [BufferManager::initMemoryPool](cpp/tensorrt_llm/runtime/bufferManager.cpp), which uses the default memory pool managed by the CUDA driver. When a `GptSession` object is destroyed, memory is returned to the memory pool and can be reused by the next instance of a `GptSession` object. Memory will be released from the pool if it is required for other memory allocations.
However, `nvidia-smi` may still show high memory occupation after memory is returned to the CUDA driver's memory pool. This should not be a concern and is intended behavior. The amount of reserved and free memory in the pool can be inspected by [BufferManager::memoryPoolReserved())](cpp/tensorrt_llm/runtime/bufferManager.cpp) and [BufferManager::memoryPoolFree())](cpp/tensorrt_llm/runtime/bufferManager.cpp), respectively.
TensorRT-LLM C++ runtime is using stream-ordered memory allocator to allocate and free buffers, see [BufferManager::initMemoryPool](source:cpp/tensorrt_llm/runtime/bufferManager.cpp), which uses the default memory pool managed by the CUDA driver. When a `GptSession` object is destroyed, memory is returned to the memory pool and can be reused by the next instance of a `GptSession` object. Memory will be released from the pool if it is required for other memory allocations.
However, `nvidia-smi` may still show high memory occupation after memory is returned to the CUDA driver's memory pool. This should not be a concern and is intended behavior. The amount of reserved and free memory in the pool can be inspected by [BufferManager::memoryPoolReserved())](source:cpp/tensorrt_llm/runtime/bufferManager.cpp) and [BufferManager::memoryPoolFree())](source:cpp/tensorrt_llm/runtime/bufferManager.cpp), respectively.

## Known Issues

Expand Down
20 changes: 12 additions & 8 deletions docs/source/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ performance that can be delivered by TensorRT-LLM.
## Methodology

The different performance numbers below were collected using the methodology
described in the benchmarks [folder](../../benchmarks/).
described in the benchmarks [folder](source:benchmarks/).

## High Throughput

Expand Down Expand Up @@ -145,6 +145,7 @@ include a more efficient implementation that runs single Matmul + SwiGLU fused k
## Reproducing Benchmarked Results

### Building the TensorRT-LLM Container

---
In order to benchmark TensorRT-LLM, you will need to follow the [Quick Start](../../README.md#quick-start)
build process to create a baseline container for building a wheel. Additionally, the development
Expand Down Expand Up @@ -231,7 +232,8 @@ in [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).

## Benchmarking per Model

#### GPT-J 6B
### GPT-J 6B

---
```shell
python examples/gptj/build.py \
Expand All @@ -255,7 +257,7 @@ python examples/gptj/build.py \
--enable_two_optimization_profiles
```

##### Throughput Benchmark
#### Throughput Benchmark

```shell
in_out_sizes=("64:128,128" "64:128,2048" "64:2048,128" "64:2048,2048")
Expand All @@ -269,7 +271,7 @@ do
done
```

##### First Token Latency Benchmark
#### First Token Latency Benchmark

```shell
in_out_sizes=("64:128,1" "64:2048,1")
Expand All @@ -285,6 +287,7 @@ done


### Llama2-7b

---
```shell
pip install -r examples/llama/requirements.txt
Expand Down Expand Up @@ -313,7 +316,7 @@ python examples/llama/build.py \
--hidden_act silu
```

##### Throughput Benchmark
#### Throughput Benchmark

```shell
in_out_sizes=("64:128,128" "64:128,2048" "64:2048,128" "32:2048,2048")
Expand All @@ -326,7 +329,7 @@ do
./cpp/build/benchmarks/gptSessionBenchmark --model llama --engine_dir /tmp/engines/llama/7b --warm_up 1 --batch_size $batch_size --duration 0 --num_runs 5 --input_output_len $in_out_dims
done
```
##### First Token Latency Benchmark
#### First Token Latency Benchmark

```shell
in_out_sizes=("64:128,1" "32:2048,1")
Expand Down Expand Up @@ -372,7 +375,7 @@ python examples/llama/build.py \
--multiple_of 4096
```

##### Throughput Benchmark
#### Throughput Benchmark

```shell
in_out_sizes=("64:128,128" "64:128,2048" "64:2048,128" "64:2048,2048")
Expand All @@ -386,7 +389,7 @@ do
done
```

##### First Token Latency Benchmark
#### First Token Latency Benchmark

```shell
in_out_sizes=("64:128,1" "64:128,1")
Expand All @@ -402,6 +405,7 @@ done


### Falcon-180B

---

Benchmarking Falcon-180B requires a custom engine per batch size, input/output sequence length due
Expand Down
42 changes: 42 additions & 0 deletions examples/gpt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -535,3 +535,45 @@ python3 build.py --model_dir=./c-model/gpt2/2-gpu --dtype bfloat16 --world_size=

mpirun -np 2 python3 ../summarize.py --engine_dir trt_engine/gpt2/bfloat16/2-gpu --hf_model_dir gpt2 --batch_size 10 --test_trt_llm --check_accuracy --tensorrt_llm_rouge1_threshold=14 --dataset_path ./dataset --no_add_special_tokens
```

### Run LoRA with the Nemo checkpoint

```bash
git clone https://huggingface.co/nvidia/GPT-2B-001
python3 nemo_ckpt_convert.py -i GPT-2B-001/GPT-2B-001_bf16_tp1.nemo -o /tmp/c-model/gpt-next-2B --tensor-parallelism 1 --storage-type bfloat16

python3 build.py --model_dir=/tmp/c-model/gpt-next-2B/1-gpu/ \
--dtype bfloat16 \
--remove_input_padding \
--use_gpt_attention_plugin \
--output_dir /tmp/gpt-next-2B/ \
--use_lora_plugin \
--max_batch_size 4 \
--max_input_len 512 \
--max_output_len 50 \
--lora_target_modules "attn_qkv"

python3 nemo_lora_convert.py -i tmp_nemo_ckpt/gpt2b_lora-900.nemo -o /tmp/gpt-next-2B/ -t bf16 # Assume lora weights are in tmp_nemo_ckpt/gpt2b_lora-900.nemo

python3 ../run.py --max_output_len=20 \
--vocab_file=/tmp/c-model/gpt-next-2B/1-gpu/tokenizer.model \
--engine_dir /tmp/gpt-next-2B/ \
--lora_dir /tmp/gpt-next-2B/ \
--lora_task_uids "lora" \
--no_add_special_tokens \
--input_text "After Washington had returned to Williamsburg, Dinwiddie ordered him to lead a larger force to assist Trent in his work. While en route, Washington learned of Trent's retreat. Since Tanaghrisson had promised support to the British, Washington continued toward Fort Duquesne and met with the Mingo leader. Learning of a French scouting party in the area, Washington, with Tanaghrisson and his party, surprised the Canadians on May 28 in what became known as the Battle of Jumonville Glen. They killed many of the Canadians, including their commanding officer, Joseph Coulon de Jumonville, whose head was reportedly split open by Tanaghrisson with a tomahawk. The historian Fred Anderson suggests that Tanaghrisson was acting to gain the support of the British and regain authority over his own people. They had been inclined to support the French, with whom they had long trading relationships. One of Tanaghrisson's men told Contrecoeur that Jumonville had been killed by British musket fire. Question: Upon learning of a French scounting party in the area, what did Washington do? Answer:"
```

Users who want to skip LoRA module may pass uid -1 with `--lora_task_uids -1`.
In that case, the model will not run the LoRA module and the results will be
different.

```bash
python3 ../run.py --max_output_len=20 \
--vocab_file=/tmp/c-model/gpt-next-2B/1-gpu/tokenizer.model \
--engine_dir /tmp/gpt-next-2B/ \
--lora_dir /tmp/gpt-next-2B/ \
--lora_task_uids "-1" \
--no_add_special_tokens \
--input_text "After Washington had returned to Williamsburg, Dinwiddie ordered him to lead a larger force to assist Trent in his work. While en route, Washington learned of Trent's retreat. Since Tanaghrisson had promised support to the British, Washington continued toward Fort Duquesne and met with the Mingo leader. Learning of a French scouting party in the area, Washington, with Tanaghrisson and his party, surprised the Canadians on May 28 in what became known as the Battle of Jumonville Glen. They killed many of the Canadians, including their commanding officer, Joseph Coulon de Jumonville, whose head was reportedly split open by Tanaghrisson with a tomahawk. The historian Fred Anderson suggests that Tanaghrisson was acting to gain the support of the British and regain authority over his own people. They had been inclined to support the French, with whom they had long trading relationships. One of Tanaghrisson's men told Contrecoeur that Jumonville had been killed by British musket fire. Question: Upon learning of a French scounting party in the area, what did Washington do? Answer:"
```

0 comments on commit 9b3e12d

Please sign in to comment.