-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark VISTA3D #671
base: dev
Are you sure you want to change the base?
Benchmark VISTA3D #671
Conversation
Signed-off-by: root <[email protected]>
for more information, see https://pre-commit.ci
As the By zooming into the red box, we can further analyze the latency of one As we analyze above, the overhead funcion calls are the bottle neck of the inference latency of the current bundle. In order to reduce them, we can improve the In addition to the inference iterations, the second and third time consumers are |
The relation of latency vs sw_batch_size is:
|
"_target_": "ToDeviced", | ||
"keys": "pred", | ||
"device": "cpu", | ||
"_disabled_": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is disabled - why don't we remove this transform from post?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heyufan1995 could you please answer this one?
Thanks
Bin
@@ -15,7 +15,8 @@ | |||
"output_dtype": "$np.float32", | |||
"output_postfix": "trans", | |||
"separate_folder": true, | |||
"input_dict": "${'image': '/data/Task09_Spleen/imagesTr/spleen_10.nii.gz', 'label_prompt': [3]}", | |||
"sw_batch_size": 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we determine it based on GPU memory size?
torch.cuda.get_device_properties('device').total_memory will give you the total memory. based on total memory and patch size you can come with a simple logic to determine if you want to use batch size 1,2,4,8,10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @SachidanandAlle , I am not sure what the a simple logic to determine if you want to use batch size 1,2,4,8,10
means here. If the patch size is fixed, then GPU memory usage is fixed. Do you suggest to change to a smaller batch size if the GPU memory is not enough? Otherwise I think we should always choose the 14 as the batchsize here.
Thanks
Bin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on GPU size determine what is the best batch size you can use. if you have enough gpu memory then use higher batch sizes. have you tried running batch size 10 for 16 GB v100? if batch size 10 works for a typical 512x512x256 image then, we are good with default 10 as the batch size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it shall be great if we can modify our sw_inferer logic to determine (we can be little smart here) the required sw_batch size based on the following 3 things.
- Current Input Image Size (after all pre-processing etc...)
- GPU Memory size (max capacity)
- Patch Size (typically recommended by the trained model)
cc: @Nic-Ma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @SachidanandAlle ,
I've updated the benchmark results with more images. According to some results, the latency will reach the best one at a specific batch size. Both increase and decrease will make the latency worse. I think this batch size is a tradeoff betweeen overhead and calculation. When batchsize is small, the overhead will waste too much time. But when batchsize is large, the calculation will waste too much time.
And I also checked the GPU memory this time. I note that for some images even batchsize 1 exceed the 16GB for V100 16GB GPU. You can check the detail in the image shown below. I don't think VISTA3D is a model that suit for GPU memory size less than 16GB. And for acceleration purpose, we may need more GPU memory.
I am trying to benchmark a (512, 512, 256) image. Will update the result later.
Set the Hi @borisfom I didn't see a significant improvement about the encoder. The inference latency of TRT and non-TRT are nearly the same. Could you please offer some suggestion here? Thanks in advance! |
@binliunls : well that seems to be the case TRT does not help much with this net - I was not able to run batch=14 on my box but batch=8 gave similar result. In fact, batch=1 does not give a lot of improvement, too. |
This does improve the inference on the V100 32GB GPU with max batchsize 6. Here are the details. |
Note that there is a memory malloc that is unecessary for all classes inference, which is related to this line |
@binliunls : wow, that's massive sync apparently caused by waiting for TRT results - does it actually help removing it though ? |
Hi @borisfom , the result didn't use TRT. It's just a straightforward MONAI bundle, because on A100 they are basically the same performance and MONAI bundles are easier to run. And yes, remove it will help to improve latency, since the removing of cudaMalloc has already reduced like 200-300ms. I will try to figure out where these API call happened in the code and see if we can further improve the performance. Thanks |
The embedding mask part of the classification header is another high latency part that can be optimized, which is shown in the image below. It's using a for-loop in python to perform the tensor multiplication which is low efficiency. The code snippet is like: b, c, h, w, d = src.shape
masks = []
for i in range(b):
mask = class_embedding @ src[[i]].view(1, c, h * w * d)
masks.append(mask.view(-1, 1, h, w, d)) We can refactor it to do a broadcast tensor multiplication call like: b, c, h, w, d = src.shape
c = class_embedding.squeeze() @ src.view(b, c, h * w * d) Here is a simple test case to test these two implement are the same: import torch
def mat_mul2(class_embedding, src):
b, c, h, w, d = src.shape
c = class_embedding.squeeze() @ src.view(b, c, h * w * d)
c = c.view(b, -1, h, w, d)
return torch.transpose(c, 0, 1)
def mat_mul1(class_embedding, src):
b, c, h, w, d = src.shape
masks = []
for i in range(b):
mask = class_embedding @ src[[i]].view(1, c, h * w * d)
masks.append(mask.view(-1, 1, h, w, d))
return torch.cat(masks, 1)
a = torch.rand((17, 1, 4))
b = torch.rand(4, 4, 12, 12, 12)
ans1 = mat_mul1(a, b)
ans2 = mat_mul2(a, b)
assert torch.allclose(ans1, ans2) After replacing the embedding mask calculation with the new way, we can reduce the whole sliding inference latency from 2.302s to 1.830s. The embedding mask part latency reduced from 278ms to 4ms. |
Fixes #8122 . ### Description As shown in [this PR](Project-MONAI/model-zoo#671), the memory malloc and mask embedding for-loop are the bottlenecks that caused the vista3d slow inference. Therefore, this PR fixed them by adding the logic for malloc and replacing the for-loop with a tensor multiplication. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [ ] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [ ] In-line docstrings updated. - [ ] Documentation updated, tested `make html` command in the `docs/` folder. Signed-off-by: binliu <[email protected]> Co-authored-by: Yiheng Wang <[email protected]> Co-authored-by: YunLiu <[email protected]>
Description
This is a function to benchmark, analyze and optimize VISTA3D bundle all class segmentation inference to get a better latency. I will try to add all the benchmark results and analyses in the PR comments. Meanwhile the general conclusion will be updated here in the PR description.
Also need to update the MONAI core code according to this PR.
Status
Work in progress
Conclusion
TODO: