Benchmark VISTA3D #671

binliunls · 2024-09-23T13:30:46Z

Description

This is a function to benchmark, analyze and optimize VISTA3D bundle all class segmentation inference to get a better latency. I will try to add all the benchmark results and analyses in the PR comments. Meanwhile the general conclusion will be updated here in the PR description.

Also need to update the MONAI core code according to this PR.

Status

Work in progress

Conclusion

A larger sw_batch_size (~14) can optimize the latency on the A100(80GB) GPU.

TODO:

Add timing function to analyze the latency of each part of the bundle.
Find out the bottleneck of the latency.
Analyze the detail latency of each part of the VISTA3D network.
Compare the TRT model and non-TRT model.
Optimize the latency with the TRT and other methods.

Signed-off-by: root <[email protected]>

for more information, see https://pre-commit.ci

binliunls · 2024-09-23T13:59:08Z

As the Range function is added to the bundle, we can take a look at the latency detail of one inference iteration, which is shown below. All the gray boxes under the SW_Patchforward_Loop stand for a model computation/prediction call with the VISTA3D network mentioned as SW_Model_Computation in the image. Since the bundle uses the sliding window inference and the given image size(512x512x77) is larger than the sliding window size(128x128x128), one sliding window inference iteration actually contains several iterations of SW_Model_Computation.

By zooming into the red box, we can further analyze the latency of one SW_Model_Computation iteration. As shown in the image below, each SW_Model_Computation will excute the cudaFree(pink) --> cuStreamSynchronize (green) in order. These overhead function calls take a large percentage of SW_Model_Computation, which are introduced by the loop inference.

As we analyze above, the overhead funcion calls are the bottle neck of the inference latency of the current bundle. In order to reduce them, we can improve the sw_batch_size parameter of the sliding window inference. Please note that this will also need more GPU memory. When we set the sw_batch_size, the latency detail looks like below. The latency has been optimized from 3.53 seconds to 2.36 seconds.

In addition to the inference iterations, the second and third time consumers are LoadingImage and SaveImage.

binliunls · 2024-09-23T15:38:36Z

The relation of latency vs sw_batch_size is:

sw_batch_size	1	4	8	10	12	14	16	20
spleen_12 (512x512x168)	2.771	2.085	2.351	2.208	1.906	2.095	1.987	1.896
spleen_38 (512x512x100)	5.280	3.887	3.535	3.689	3.691	3.592	3.838	3.547
spleen_10 (512x512x55)	3.404	2.803	2.508	2.362	2.289	2.301	2.594	2.74
spleen_9 (512x512x41)	2.772	2.066	2.071	2.168	1.788	2.186	2.079	1.924

SachidanandAlle · 2024-09-23T18:38:03Z

models/vista3d/configs/inference.json

+            "_target_": "ToDeviced",
+            "keys": "pred",
+            "device": "cpu",
+            "_disabled_": true


since this is disabled - why don't we remove this transform from post?

@heyufan1995 could you please answer this one?

Thanks
Bin

SachidanandAlle · 2024-09-23T18:44:38Z

models/vista3d/configs/inference.json

@@ -15,7 +15,8 @@
    "output_dtype": "$np.float32",
    "output_postfix": "trans",
    "separate_folder": true,
-    "input_dict": "${'image': '/data/Task09_Spleen/imagesTr/spleen_10.nii.gz', 'label_prompt': [3]}",
+    "sw_batch_size": 10,


can we determine it based on GPU memory size?
torch.cuda.get_device_properties('device').total_memory will give you the total memory. based on total memory and patch size you can come with a simple logic to determine if you want to use batch size 1,2,4,8,10?

Hi @SachidanandAlle , I am not sure what the a simple logic to determine if you want to use batch size 1,2,4,8,10 means here. If the patch size is fixed, then GPU memory usage is fixed. Do you suggest to change to a smaller batch size if the GPU memory is not enough? Otherwise I think we should always choose the 14 as the batchsize here.

Thanks
Bin

based on GPU size determine what is the best batch size you can use. if you have enough gpu memory then use higher batch sizes. have you tried running batch size 10 for 16 GB v100? if batch size 10 works for a typical 512x512x256 image then, we are good with default 10 as the batch size

Also it shall be great if we can modify our sw_inferer logic to determine (we can be little smart here) the required sw_batch size based on the following 3 things.

Current Input Image Size (after all pre-processing etc...)

GPU Memory size (max capacity)

Patch Size (typically recommended by the trained model)

cc: @Nic-Ma

Hi @SachidanandAlle ,

I've updated the benchmark results with more images. According to some results, the latency will reach the best one at a specific batch size. Both increase and decrease will make the latency worse. I think this batch size is a tradeoff betweeen overhead and calculation. When batchsize is small, the overhead will waste too much time. But when batchsize is large, the calculation will waste too much time.

And I also checked the GPU memory this time. I note that for some images even batchsize 1 exceed the 16GB for V100 16GB GPU. You can check the detail in the image shown below. I don't think VISTA3D is a model that suit for GPU memory size less than 16GB. And for acceleration purpose, we may need more GPU memory.

I am trying to benchmark a (512, 512, 256) image. Will update the result later.

binliunls · 2024-09-24T15:37:22Z

Set the sw_batch_size to 14 and run the bundle with TRT inference (only compile encoder and run all classes segmentation). The benchmarks are shown below. The upper one is the latency detail about the original bundle, while the lower one is TRT bundler.

Hi @borisfom I didn't see a significant improvement about the encoder. The inference latency of TRT and non-TRT are nearly the same. Could you please offer some suggestion here? Thanks in advance!

Original bundle

TRT bundle

borisfom · 2024-09-25T06:18:34Z

@binliunls : well that seems to be the case TRT does not help much with this net - I was not able to run batch=14 on my box but batch=8 gave similar result. In fact, batch=1 does not give a lot of improvement, too.
It looks like TRT is currently only running a small fraction of encoder's forward action - and expanding that may not be straightforward. I will look a bit more at the model, but most likely, expanding converted part won't be possible as that would make it dependent on dynamic arguments.

binliunls · 2024-09-26T08:58:05Z

@binliunls : well that seems to be the case TRT does not help much with this net - I was not able to run batch=14 on my box but batch=8 gave similar result. In fact, batch=1 does not give a lot of improvement, too. It looks like TRT is currently only running a small fraction of encoder's forward action - and expanding that may not be straightforward. I will look a bit more at the model, but most likely, expanding converted part won't be possible as that would make it dependent on dynamic arguments.

This does improve the inference on the V100 32GB GPU with max batchsize 6. Here are the details.

MONAI bundle:

TRT bundle:

binliunls · 2024-09-26T16:39:26Z

Note that there is a memory malloc that is unecessary for all classes inference, which is related to this line

borisfom · 2024-09-26T21:15:45Z

@binliunls : wow, that's massive sync apparently caused by waiting for TRT results - does it actually help removing it though ?

binliunls · 2024-09-27T03:44:47Z

@binliunls : wow, that's massive sync apparently caused by waiting for TRT results - does it actually help removing it though ?

Hi @borisfom , the result didn't use TRT. It's just a straightforward MONAI bundle, because on A100 they are basically the same performance and MONAI bundles are easier to run. And yes, remove it will help to improve latency, since the removing of cudaMalloc has already reduced like 200-300ms. I will try to figure out where these API call happened in the code and see if we can further improve the performance.

Thanks
Bin

binliunls · 2024-09-27T16:32:50Z

The embedding mask part of the classification header is another high latency part that can be optimized, which is shown in the image below.

It's using a for-loop in python to perform the tensor multiplication which is low efficiency. The code snippet is like:

b, c, h, w, d = src.shape
masks = []
for i in range(b):
    mask = class_embedding @ src[[i]].view(1, c, h * w * d)
    masks.append(mask.view(-1, 1, h, w, d))

We can refactor it to do a broadcast tensor multiplication call like:

b, c, h, w, d = src.shape
c = class_embedding.squeeze() @ src.view(b, c, h * w * d)

Here is a simple test case to test these two implement are the same:

import torch

def mat_mul2(class_embedding, src):
    b, c, h, w, d = src.shape
    c = class_embedding.squeeze() @ src.view(b, c, h * w * d)
    c = c.view(b, -1, h, w, d)
    return torch.transpose(c, 0, 1)

def mat_mul1(class_embedding, src):
    b, c, h, w, d = src.shape
    masks = []
    for i in range(b):
        mask = class_embedding @ src[[i]].view(1, c, h * w * d)
        masks.append(mask.view(-1, 1, h, w, d))
    return torch.cat(masks, 1)

a = torch.rand((17, 1, 4))
b = torch.rand(4, 4, 12, 12, 12)
ans1 = mat_mul1(a, b)
ans2 = mat_mul2(a, b)
assert torch.allclose(ans1, ans2)

After replacing the embedding mask calculation with the new way, we can reduce the whole sliding inference latency from 2.302s to 1.830s. The embedding mask part latency reduced from 278ms to 4ms.

Fixes #8122 . ### Description As shown in [this PR](Project-MONAI/model-zoo#671), the memory malloc and mask embedding for-loop are the bottlenecks that caused the vista3d slow inference. Therefore, this PR fixed them by adding the logic for malloc and replacing the for-loop with a tensor multiplication. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [ ] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [ ] In-line docstrings updated. - [ ] Documentation updated, tested `make html` command in the `docs/` folder. Signed-off-by: binliu <[email protected]> Co-authored-by: Yiheng Wang <[email protected]> Co-authored-by: YunLiu <[email protected]>

root and others added 2 commits September 23, 2024 13:18

add range function to the bundle

f1a3af4

Signed-off-by: root <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

501911c

for more information, see https://pre-commit.ci

binliunls requested review from yiheng-wang-nv, heyufan1995, SachidanandAlle and Nic-Ma September 23, 2024 14:27

SachidanandAlle reviewed Sep 23, 2024

View reviewed changes

This was referenced Sep 30, 2024

Optimize the VISTA3D latency #684

Closed

Optimize the VISTA3D latency Project-MONAI/MONAI#8122

Closed

Optimize VISTA3D Project-MONAI/MONAI#8123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark VISTA3D #671

Benchmark VISTA3D #671

binliunls commented Sep 23, 2024 •

edited

Loading

binliunls commented Sep 23, 2024 •

edited

Loading

binliunls commented Sep 23, 2024 •

edited

Loading

SachidanandAlle Sep 23, 2024

binliunls Sep 24, 2024

SachidanandAlle Sep 23, 2024 •

edited

Loading

binliunls Sep 24, 2024

SachidanandAlle Sep 24, 2024

SachidanandAlle Sep 24, 2024

binliunls Sep 25, 2024 •

edited

Loading

binliunls commented Sep 24, 2024

borisfom commented Sep 25, 2024

binliunls commented Sep 26, 2024

binliunls commented Sep 26, 2024 •

edited

Loading

borisfom commented Sep 26, 2024

binliunls commented Sep 27, 2024 •

edited

Loading

binliunls commented Sep 27, 2024

Benchmark VISTA3D #671

Are you sure you want to change the base?

Benchmark VISTA3D #671

Conversation

binliunls commented Sep 23, 2024 • edited Loading

Description

Status

Conclusion

TODO:

binliunls commented Sep 23, 2024 • edited Loading

binliunls commented Sep 23, 2024 • edited Loading

SachidanandAlle Sep 23, 2024

Choose a reason for hiding this comment

binliunls Sep 24, 2024

Choose a reason for hiding this comment

SachidanandAlle Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

binliunls Sep 24, 2024

Choose a reason for hiding this comment

SachidanandAlle Sep 24, 2024

Choose a reason for hiding this comment

SachidanandAlle Sep 24, 2024

Choose a reason for hiding this comment

binliunls Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

binliunls commented Sep 24, 2024

borisfom commented Sep 25, 2024

binliunls commented Sep 26, 2024

binliunls commented Sep 26, 2024 • edited Loading

borisfom commented Sep 26, 2024

binliunls commented Sep 27, 2024 • edited Loading

binliunls commented Sep 27, 2024

binliunls commented Sep 23, 2024 •

edited

Loading

binliunls commented Sep 23, 2024 •

edited

Loading

binliunls commented Sep 23, 2024 •

edited

Loading

SachidanandAlle Sep 23, 2024 •

edited

Loading

binliunls Sep 25, 2024 •

edited

Loading

binliunls commented Sep 26, 2024 •

edited

Loading

binliunls commented Sep 27, 2024 •

edited

Loading