TensorRT 10.7-GA OSS Release (#4269)

Signed-off-by: Kevin Chen <[email protected]>
NVIDIA · Dec 5, 2024 · 17003e4 · 17003e4
1 parent c468d67
commit 17003e4
Show file tree

Hide file tree

Showing 81 changed files with 1,411 additions and 530 deletions.
diff --git a/.gitmodules b/.gitmodules
@@ -9,4 +9,4 @@
 [submodule "parsers/onnx"]
 	path = parsers/onnx
 	url = https://github.com/onnx/onnx-tensorrt.git
-	branch = release/10.6-GA
+	branch = release/10.7-GA
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,30 @@
 # TensorRT OSS Release Changelog
 
+## 10.7.0 GA - 2024-12-4
+Key Feature and Updates:
+
+- Demo Changes
+  - demoDiffusion
+    - Enabled low-vram for the Flux pipeline. Users can now run the pipelines on systems with 32GB VRAM.
+    - Added support for [FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) pipeline.
+    - Enabled weight streaming mode for Flux pipeline.
+
+- Plugin Changes
+  - On Blackwell and later platforms, TensorRT will drop cuDNN support on the following categories of plugins
+    - User-written `IPluginV2Ext`, `IPluginV2DynamicExt`, and `IPluginV2IOExt` plugins that are dependent on cuDNN handles provided by TensorRT (via the `attachToContext()` API).
+    - TensorRT standard plugins that use cuDNN, specifically:
+      - `InstanceNormalization_TRT` (version: 1, 2, and 3) present in `plugin/instanceNormalizationPlugin/`.
+      - `GroupNormalizationPlugin` (version: 1) present in `plugin/groupNormalizationPlugin/`.
+      - Note: These normalization plugins are superseded by TensorRT’s native `INormalizationLayer` ([C++](https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_normalization_layer.html), [Python](https://docs.nvidia.com/deeplearning/tensorrt/operators/docs/Normalization.html)). TensorRT support for cuDNN-dependent plugins remain unchanged on pre-Blackwell platforms.
+
+- Parser Changes
+  - Now prioritizes using plugins over local functions when a corresponding plugin is available in the registry.
+  - Added dynamic axes support for `Squeeze` and `Unsqueeze` operations.
+  - Added support for parsing mixed-precision `BatchNormalization` nodes in strongly-typed mode.
+
+- Addressed Issues
+  - Fixed [4113](https://github.com/NVIDIA/TensorRT/issues/4113).
+
 ## 10.6.0 GA - 2024-11-05
 Key Feature and Updates:
 - Demo Changes

diff --git a/README.md b/README.md
@@ -26,7 +26,7 @@ You can skip the **Build** section to enjoy TensorRT with Python.
 To build the TensorRT-OSS components, you will first need the following software packages.
 
 **TensorRT GA build**
-* TensorRT v10.6.0.26
+* TensorRT v10.7.0.23
   * Available from direct download links listed below
 
 **System Packages**
@@ -73,25 +73,25 @@ To build the TensorRT-OSS components, you will first need the following software
     If using the TensorRT OSS build container, TensorRT libraries are preinstalled under `/usr/lib/x86_64-linux-gnu` and you may skip this step.
 
     Else download and extract the TensorRT GA build from [NVIDIA Developer Zone](https://developer.nvidia.com) with the direct links below:
-      - [TensorRT 10.6.0.26 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/tars/TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-11.8.tar.gz)
-      - [TensorRT 10.6.0.26 for CUDA 12.6, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/tars/TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz)
-      - [TensorRT 10.6.0.26 for CUDA 11.8, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/zip/TensorRT-10.6.0.26.Windows.win10.cuda-11.8.zip)
-      - [TensorRT 10.6.0.26 for CUDA 12.6, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.6.0/zip/TensorRT-10.6.0.26.Windows.win10.cuda-12.6.zip)
+      - [TensorRT 10.7.0.23 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.7.0/tars/TensorRT-10.7.0.23.Linux.x86_64-gnu.cuda-11.8.tar.gz)
+      - [TensorRT 10.7.0.23 for CUDA 12.6, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.7.0/tars/TensorRT-10.7.0.23.Linux.x86_64-gnu.cuda-12.6.tar.gz)
+      - [TensorRT 10.7.0.23 for CUDA 11.8, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.7.0/zip/TensorRT-10.7.0.23.Windows.win10.cuda-11.8.zip)
+      - [TensorRT 10.7.0.23 for CUDA 12.6, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.7.0/zip/TensorRT-10.7.0.23.Windows.win10.cuda-12.6.zip)
 
 
     **Example: Ubuntu 20.04 on x86-64 with cuda-12.6**
 
     ```bash
     cd ~/Downloads
-    tar -xvzf TensorRT-10.6.0.26.Linux.x86_64-gnu.cuda-12.6.tar.gz
-    export TRT_LIBPATH=`pwd`/TensorRT-10.6.0.26
+    tar -xvzf TensorRT-10.7.0.23.Linux.x86_64-gnu.cuda-12.6.tar.gz
+    export TRT_LIBPATH=`pwd`/TensorRT-10.7.0.23
     ```
 
     **Example: Windows on x86-64 with cuda-12.6**
 
     ```powershell
-    Expand-Archive -Path TensorRT-10.6.0.26.Windows.win10.cuda-12.6.zip
-    $env:TRT_LIBPATH="$pwd\TensorRT-10.6.0.26\lib"
+    Expand-Archive -Path TensorRT-10.7.0.23.Windows.win10.cuda-12.6.zip
+    $env:TRT_LIBPATH="$pwd\TensorRT-10.7.0.23\lib"
     ```
 
 ## Setting Up The Build Environment

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-10.6.0.26
+10.7.0.23
diff --git a/demo/BERT/README.md b/demo/BERT/README.md
@@ -75,7 +75,7 @@ The following software version configuration has been tested:
 |Software|Version|
 |--------|-------|
 |Python|>=3.8|
-|TensorRT|10.6.0.26|
+|TensorRT|10.7.0.23|
 |CUDA|12.6|
 
 ## Setup

diff --git a/demo/Diffusion/README.md b/demo/Diffusion/README.md
@@ -7,7 +7,7 @@ This demo application ("demoDiffusion") showcases the acceleration of Stable Dif
 ### Clone the TensorRT OSS repository
 
 ```bash
-git clone [email protected]:NVIDIA/TensorRT.git -b release/10.5 --single-branch
+git clone [email protected]:NVIDIA/TensorRT.git -b release/10.7 --single-branch
 cd TensorRT
 ```
 
@@ -16,7 +16,7 @@ cd TensorRT
 Install nvidia-docker using [these intructions](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker).
 
 ```bash
-docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:24.07-py3 /bin/bash
+docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:24.10-py3 /bin/bash
 ```
 
 NOTE: The demo supports CUDA>=11.8
@@ -43,12 +43,12 @@ pip3 install -r requirements.txt
 
 > NOTE: demoDiffusion has been tested on systems with NVIDIA H100, A100, L40, T4, and RTX4090 GPUs, and the following software configuration.
 ```
-diffusers           0.30.2
+diffusers           0.31.0
 onnx                1.15.0
 onnx-graphsurgeon   0.5.2
 onnxruntime         1.16.3
 polygraphy          0.49.9
-tensorrt            10.6.0.26
+tensorrt            10.7.0.23
 tokenizers          0.13.3
 torch               2.2.0
 transformers        4.42.2
@@ -66,6 +66,7 @@ python3 demo_img2img.py --help
 python3 demo_inpaint.py --help
 python3 demo_controlnet.py --help
 python3 demo_txt2img_xl.py --help
+python3 demo_txt2img_flux.py --help
 ```
 
 ### HuggingFace user access token
@@ -257,23 +258,43 @@ python3 demo_stable_cascade.py --onnx-opset=16 "Anthropomorphic cat dressed as a
 
 ### Generate an image guided by a text prompt using Flux
 
+Run the below command to generate an image with FLUX.1 Dev in FP16.
+
 ```bash
 python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN
 ```
 
-Run the below command to generate an image with FLUX in BF16.
+Run the below command to generate an image with FLUX.1 Dev in BF16.
 
 ```bash
 python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --bf16
 ```
 
-Run the below command to generate an image with FLUX in FP8. (FP8 is only supppoted on Hopper.)
+Run the below command to generate an image with FLUX.1 Dev in FP8. (FP8 is suppported on Hopper and Ada.)
 
 ```bash
 python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --fp8
 ```
 
-NOTE: Running the Flux pipeline requires 80GB of GPU memory or higher
+Run the below command to generate an image with FLUX.1 Schnell in FP16.
+
+```bash
+python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --version="flux.1-schnell"
+```
+
+Run the below command to generate an image with FLUX.1 Schnell in BF16.
+
+```bash
+python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --version="flux.1-schnell" --bf16
+```
+
+Run the below command to generate an image with FLUX.1 Schnell in FP8. (FP8 is suppported on Hopper and Ada.)
+
+```bash
+python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --version="flux.1-schnell" --fp8
+```
+
+NOTE: Running the FLUX.1 Dev or FLUX.1 Schnell pipeline requires 48GB or 24GB of GPU memory or higher, respectively.
 
 ## Configuration options
 - Noise scheduler can be set using `--scheduler <scheduler>`. Note: not all schedulers are available for every version.

diff --git a/demo/Diffusion/demo_txt2img_flux.py b/demo/Diffusion/demo_txt2img_flux.py
@@ -20,7 +20,12 @@
 from cuda import cudart
 
 from flux_pipeline import FluxPipeline
-from utilities import PIPELINE_TYPE, add_arguments, process_pipeline_args
+from utilities import (
+    PIPELINE_TYPE,
+    add_arguments,
+    process_pipeline_args,
+    VALID_OPTIMIZATION_LEVELS,
+)
 
 
 def parse_args():
@@ -32,7 +37,7 @@ def parse_args():
         "--version",
         type=str,
         default="flux.1-dev",
-        choices=["flux.1-dev"],
+        choices=("flux.1-dev", "flux.1-schnell"),
         help="Version of Flux",
     )
     parser.add_argument(
@@ -65,20 +70,48 @@ def parse_args():
     parser.add_argument(
         "--max_sequence_length",
         type=int,
-        default=512,
-        help="Maximum sequence length to use with the prompt",
+        help="Maximum sequence length to use with the prompt. Can be up to 512 for the dev and 256 for the schnell variant.",
     )
     parser.add_argument(
-        "--bf16",
-        action='store_true',
-        help="Run pipeline in BFloat16 precision"
+        "--bf16", action="store_true", help="Run pipeline in BFloat16 precision"
     )
     parser.add_argument(
         "--low-vram",
+        action="store_true",
+        help="Optimize for low VRAM usage, possibly at the expense of inference performance. Disabled by default.",
+    )
+    parser.add_argument(
+        "--optimization-level",
+        type=int,
+        default=3,
+        help=f"Set the builder optimization level to build the engine with. A higher level allows TensorRT to spend more building time for more optimization options. Must be one of {VALID_OPTIMIZATION_LEVELS}.",
+    )
+    parser.add_argument(
+        "--torch-fallback",
+        default=None,
+        type=str,
+        help="Name list of models to be inferenced using torch instead of TRT. For example --torch-fallback t5,transformer. If --torch-inference set, this parameter will be ignored."
+    )
+
+    parser.add_argument(
+        "--ws",
         action='store_true',
-        help="Optimize for low VRAM usage, possibly at the expense of inference performance. Disabled by default."
+        help="Build TensorRT engines with weight streaming enabled."
     )
 
+    parser.add_argument(
+        "--t5-ws-percentage",
+        type=int,
+        default=None,
+        help="Set runtime weight streaming budget as the percentage of the size of streamable weights for the T5 model. This argument only takes effect when --ws is set. 0 streams the most weights and 100 or None streams no weights. "
+    )
+
+    parser.add_argument(
+        "--transformer-ws-percentage",
+        type=int,
+        default=None,
+        help="Set runtime weight streaming budget as the percentage of the size of streamable weights for the transformer model. This argument only takes effect when --ws is set. 0 streams the most weights and 100 or None streams no weights."
+    )
     return parser.parse_args()
 
 
@@ -100,10 +133,24 @@ def process_demo_args(args):
     if len(prompt2) == 1:
         prompt2 = prompt2 * batch_size
 
-    if args.max_sequence_length is not None and args.max_sequence_length > 512:
-        raise ValueError(
-            f"`max_sequence_length` cannot be greater than 512 but is {args.max_sequence_length}"
-        )
+    max_seq_supported_by_model = {
+        "flux.1-schnell": 256,
+        "flux.1-dev": 512,
+    }[args.version]
+    if args.max_sequence_length is not None:
+        if args.max_sequence_length > max_seq_supported_by_model:
+            raise ValueError(
+                f"For {args.version}, `max_sequence_length` cannot be greater than {max_seq_supported_by_model} but is {args.max_sequence_length}"
+            )
+    else:
+        args.max_sequence_length = max_seq_supported_by_model
+
+    if args.torch_fallback and not args.torch_inference:
+        args.torch_fallback = args.torch_fallback.split(",")
+
+    if args.torch_fallback and args.torch_inference:
+        print(f"[W] All models will run in PyTorch when --torch-inference is set. Parameter --torch-fallback will be ignored.")
+        args.torch_fallback = None
 
     args_run_demo = (
         prompt,
@@ -131,6 +178,10 @@ def process_demo_args(args):
         max_sequence_length=args.max_sequence_length,
         bf16=args.bf16,
         low_vram=args.low_vram,
+        torch_fallback=args.torch_fallback,
+        weight_streaming=args.ws,
+        t5_weight_streaming_budget_percentage=args.t5_ws_percentage,
+        transformer_weight_streaming_budget_percentage=args.transformer_ws_percentage,
         **kwargs_init_pipeline)
 
     # Load TensorRT engines and pytorch modules