Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark VISTA3D #671

Draft
wants to merge 2 commits into
base: dev
Choose a base branch
from
Draft

Conversation

binliunls
Copy link
Contributor

@binliunls binliunls commented Sep 23, 2024

Description

This is a function to benchmark, analyze and optimize VISTA3D bundle all class segmentation inference to get a better latency. I will try to add all the benchmark results and analyses in the PR comments. Meanwhile the general conclusion will be updated here in the PR description.

Also need to update the MONAI core code according to this PR.

Status

Work in progress

Conclusion

  1. A larger sw_batch_size (~14) can optimize the latency on the A100(80GB) GPU.

TODO:

  • Add timing function to analyze the latency of each part of the bundle.
  • Find out the bottleneck of the latency.
  • Analyze the detail latency of each part of the VISTA3D network.
  • Compare the TRT model and non-TRT model.
  • Optimize the latency with the TRT and other methods.

@binliunls
Copy link
Contributor Author

binliunls commented Sep 23, 2024

As the Range function is added to the bundle, we can take a look at the latency detail of one inference iteration, which is shown below. All the gray boxes under the SW_Patchforward_Loop stand for a model computation/prediction call with the VISTA3D network mentioned as SW_Model_Computation in the image. Since the bundle uses the sliding window inference and the given image size(512x512x77) is larger than the sliding window size(128x128x128), one sliding window inference iteration actually contains several iterations of SW_Model_Computation.

image

By zooming into the red box, we can further analyze the latency of one SW_Model_Computation iteration. As shown in the image below, each SW_Model_Computation will excute the cudaFree(pink) --> cuStreamSynchronize (green) in order. These overhead function calls take a large percentage of SW_Model_Computation, which are introduced by the loop inference.

image

As we analyze above, the overhead funcion calls are the bottle neck of the inference latency of the current bundle. In order to reduce them, we can improve the sw_batch_size parameter of the sliding window inference. Please note that this will also need more GPU memory. When we set the sw_batch_size, the latency detail looks like below. The latency has been optimized from 3.53 seconds to 2.36 seconds.

image

In addition to the inference iterations, the second and third time consumers are LoadingImage and SaveImage.

image

@binliunls
Copy link
Contributor Author

binliunls commented Sep 23, 2024

The relation of latency vs sw_batch_size is:

sw_batch_size 1 4 8 10 12 14 16 20
spleen_12 (512x512x168) 2.771 2.085 2.351 2.208 1.906 2.095 1.987 1.896
spleen_38 (512x512x100) 5.280 3.887 3.535 3.689 3.691 3.592 3.838 3.547
spleen_10 (512x512x55) 3.404 2.803 2.508 2.362 2.289 2.301 2.594 2.74
spleen_9 (512x512x41) 2.772 2.066 2.071 2.168 1.788 2.186 2.079 1.924

image

"_target_": "ToDeviced",
"keys": "pred",
"device": "cpu",
"_disabled_": true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is disabled - why don't we remove this transform from post?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heyufan1995 could you please answer this one?

Thanks
Bin

@@ -15,7 +15,8 @@
"output_dtype": "$np.float32",
"output_postfix": "trans",
"separate_folder": true,
"input_dict": "${'image': '/data/Task09_Spleen/imagesTr/spleen_10.nii.gz', 'label_prompt': [3]}",
"sw_batch_size": 10,
Copy link
Collaborator

@SachidanandAlle SachidanandAlle Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we determine it based on GPU memory size?
torch.cuda.get_device_properties('device').total_memory will give you the total memory. based on total memory and patch size you can come with a simple logic to determine if you want to use batch size 1,2,4,8,10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SachidanandAlle , I am not sure what the a simple logic to determine if you want to use batch size 1,2,4,8,10 means here. If the patch size is fixed, then GPU memory usage is fixed. Do you suggest to change to a smaller batch size if the GPU memory is not enough? Otherwise I think we should always choose the 14 as the batchsize here.

Thanks
Bin

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on GPU size determine what is the best batch size you can use. if you have enough gpu memory then use higher batch sizes. have you tried running batch size 10 for 16 GB v100? if batch size 10 works for a typical 512x512x256 image then, we are good with default 10 as the batch size

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it shall be great if we can modify our sw_inferer logic to determine (we can be little smart here) the required sw_batch size based on the following 3 things.

  1. Current Input Image Size (after all pre-processing etc...)
  2. GPU Memory size (max capacity)
  3. Patch Size (typically recommended by the trained model)

cc: @Nic-Ma

Copy link
Contributor Author

@binliunls binliunls Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SachidanandAlle ,

I've updated the benchmark results with more images. According to some results, the latency will reach the best one at a specific batch size. Both increase and decrease will make the latency worse. I think this batch size is a tradeoff betweeen overhead and calculation. When batchsize is small, the overhead will waste too much time. But when batchsize is large, the calculation will waste too much time.

And I also checked the GPU memory this time. I note that for some images even batchsize 1 exceed the 16GB for V100 16GB GPU. You can check the detail in the image shown below. I don't think VISTA3D is a model that suit for GPU memory size less than 16GB. And for acceleration purpose, we may need more GPU memory.

image

I am trying to benchmark a (512, 512, 256) image. Will update the result later.

@binliunls
Copy link
Contributor Author

Set the sw_batch_size to 14 and run the bundle with TRT inference (only compile encoder and run all classes segmentation). The benchmarks are shown below. The upper one is the latency detail about the original bundle, while the lower one is TRT bundler.

Hi @borisfom I didn't see a significant improvement about the encoder. The inference latency of TRT and non-TRT are nearly the same. Could you please offer some suggestion here? Thanks in advance!

Original bundle
image

TRT bundle
image

@borisfom
Copy link
Contributor

@binliunls : well that seems to be the case TRT does not help much with this net - I was not able to run batch=14 on my box but batch=8 gave similar result. In fact, batch=1 does not give a lot of improvement, too.
It looks like TRT is currently only running a small fraction of encoder's forward action - and expanding that may not be straightforward. I will look a bit more at the model, but most likely, expanding converted part won't be possible as that would make it dependent on dynamic arguments.

@binliunls
Copy link
Contributor Author

@binliunls : well that seems to be the case TRT does not help much with this net - I was not able to run batch=14 on my box but batch=8 gave similar result. In fact, batch=1 does not give a lot of improvement, too. It looks like TRT is currently only running a small fraction of encoder's forward action - and expanding that may not be straightforward. I will look a bit more at the model, but most likely, expanding converted part won't be possible as that would make it dependent on dynamic arguments.

This does improve the inference on the V100 32GB GPU with max batchsize 6. Here are the details.

MONAI bundle:
image

TRT bundle:
image

@binliunls
Copy link
Contributor Author

binliunls commented Sep 26, 2024

Note that there is a memory malloc that is unecessary for all classes inference, which is related to this line

image

@borisfom
Copy link
Contributor

@binliunls : wow, that's massive sync apparently caused by waiting for TRT results - does it actually help removing it though ?

@binliunls
Copy link
Contributor Author

binliunls commented Sep 27, 2024

@binliunls : wow, that's massive sync apparently caused by waiting for TRT results - does it actually help removing it though ?

Hi @borisfom , the result didn't use TRT. It's just a straightforward MONAI bundle, because on A100 they are basically the same performance and MONAI bundles are easier to run. And yes, remove it will help to improve latency, since the removing of cudaMalloc has already reduced like 200-300ms. I will try to figure out where these API call happened in the code and see if we can further improve the performance.

Thanks
Bin

@binliunls
Copy link
Contributor Author

The embedding mask part of the classification header is another high latency part that can be optimized, which is shown in the image below.

image

It's using a for-loop in python to perform the tensor multiplication which is low efficiency. The code snippet is like:

b, c, h, w, d = src.shape
masks = []
for i in range(b):
    mask = class_embedding @ src[[i]].view(1, c, h * w * d)
    masks.append(mask.view(-1, 1, h, w, d))

We can refactor it to do a broadcast tensor multiplication call like:

b, c, h, w, d = src.shape
c = class_embedding.squeeze() @ src.view(b, c, h * w * d)

Here is a simple test case to test these two implement are the same:

import torch

def mat_mul2(class_embedding, src):
    b, c, h, w, d = src.shape
    c = class_embedding.squeeze() @ src.view(b, c, h * w * d)
    c = c.view(b, -1, h, w, d)
    return torch.transpose(c, 0, 1)

def mat_mul1(class_embedding, src):
    b, c, h, w, d = src.shape
    masks = []
    for i in range(b):
        mask = class_embedding @ src[[i]].view(1, c, h * w * d)
        masks.append(mask.view(-1, 1, h, w, d))
    return torch.cat(masks, 1)

a = torch.rand((17, 1, 4))
b = torch.rand(4, 4, 12, 12, 12)
ans1 = mat_mul1(a, b)
ans2 = mat_mul2(a, b)
assert torch.allclose(ans1, ans2)

After replacing the embedding mask calculation with the new way, we can reduce the whole sliding inference latency from 2.302s to 1.830s. The embedding mask part latency reduced from 278ms to 4ms.

image

KumoLiu added a commit to Project-MONAI/MONAI that referenced this pull request Oct 22, 2024
Fixes #8122 .

### Description
As shown in [this
PR](Project-MONAI/model-zoo#671), the memory
malloc and mask embedding for-loop are the bottlenecks that caused the
vista3d slow inference. Therefore, this PR fixed them by adding the
logic for malloc and replacing the for-loop with a tensor
multiplication.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [ ] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated, tested `make html` command in the `docs/`
folder.

Signed-off-by: binliu <[email protected]>
Co-authored-by: Yiheng Wang <[email protected]>
Co-authored-by: YunLiu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants