Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize CPU and Memory performance for Resize linear mode parser #3731

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

coxuamd
Copy link

@coxuamd coxuamd commented Dec 22, 2024

Re-write calc_neighbor_points() by composing index from binary bits instead of recursion.

With the optimized calc_neighbor_points(), CPU time required by 90% and peak memory utilization is significantly reduced.

Perf. comparision on VM w/ 12-Core EPYC 9V64 + 128 GB mem:

n_dim out_elements New t-CPU (us) Old t-CPU (us) t-CPU Ratio
4 786,432 170,377 1,878,299 0.0907
4 1,572,864 383,125 4,009,335 0.0956
4 3,145,728 784,388 7,670,960 0.1023
4 6,291,456 1,567,753 15,095,017 0.1039
4 12,582,912 3,139,452 29,622,921 0.1060
4 25,165,824 6,266,153 58,332,233 0.1074
4 50,331,648 12,517,674 116,923,368 0.1071
4 100,663,296 25,011,425 OOM Kill N/A

Copy link
Collaborator

@TedThemistokleous TedThemistokleous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

I have left comments for the tidy errors and concerns I have with the changes.

Please also run and ensure all our testcases are functional via building migraphx with make check when you build.

Some of the Onnx verify tests are failing with your changes for resize. These need to be working to ensure no lose of functionality between old and new methods.

[==========] 286 tests ran
[  FAILED  ] 2 tests failed
[  FAILED  ] resize_upsample_linear_ac_test
[  FAILED  ] resize_upsample_linear_test

The two files that seem to break are found at

test/onnx/verify/resize_upsample_linear_test.cpp
test/onnx/verify/resize_upsample_linear_ac_test.cpp

Also please ensure your changes meet format as outlined from here

https://github.com/ROCm/AMDMIGraphX/actions/runs/12454557312/job/34765947690?pr=3731

@TedThemistokleous TedThemistokleous added the high priority A PR with high priority for review and merging. label Dec 22, 2024
@lakhinderwalia
Copy link
Contributor

@TedThemistokleous, Why is this PR showing up on rocm:develop, while others do on develop, please?

@TedThemistokleous
Copy link
Collaborator

@TedThemistokleous, Why is this PR showing up on rocm:develop, while others do on develop, please?

External contributor I think? I believe its should be the same repo the commit is going to.

Copy link

codecov bot commented Dec 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.16%. Comparing base (6acc1f9) to head (006eae7).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3731      +/-   ##
===========================================
- Coverage    92.16%   92.16%   -0.01%     
===========================================
  Files          515      515              
  Lines        21978    21975       -3     
===========================================
- Hits         20256    20253       -3     
  Misses        1722     1722              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Re-write calc_neighbor_points() by composing index from binary bits
instead of recursion.

With the optimized calc_neighbor_points(), CPU time required by 90% and
peak memory utilization is significantly reduced.

Perf. comparision on VM w/ 12-Core EPYC 9V64 + 128 GB mem:
  n_dim   out_elements   New t-CPU (us)   Old t-CPU (us)   t-CPU Ratio
 ------- -------------- ---------------- ---------------- -------------
      4        786,432          170,377        1,878,299        0.0907
      4      1,572,864          383,125        4,009,335        0.0956
      4      3,145,728          784,388        7,670,960        0.1023
      4      6,291,456        1,567,753       15,095,017        0.1039
      4     12,582,912        3,139,452       29,622,921        0.1060
      4     25,165,824        6,266,153       58,332,233        0.1074
      4     50,331,648       12,517,674      116,923,368        0.1071
      4    100,663,296       25,011,425         OOM Kill           N/A

Signed-off-by: Colin Xu <[email protected]>
Copy link
Contributor

@lakhinderwalia lakhinderwalia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking a stab at cleaning this function. However, this PR needs two additional things (besides all test-cases to pass):

  • Code comments for calc_neighbor_points. And this is not the fault of this PR, but the comments were/are entirely missing, yet this function is very complicated for a reviewer's understanding, and I am not sure what exactly is it doing. And that documentation needs to be fixed now, since this function is being rewritten.

  • A unit-test to specifically test this extremely complex function. It should have been there much earlier, but it can be added now in this PR.

int i_dim,
std::vector<std::vector<std::size_t>> vec_dims,
const shape& in_s)
static void calc_neighbor_points(const std::vector<std::vector<std::vector<std::size_t>>>& vvv_ind,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This function doesn't have to take vec_ind as an input parameter. It can simply be a return value, as before -- just define it at the beginning of this function.
  • built-ins like n_dims and out_elements should not be const references. Just use the value please.
  • Perhaps n_dims can be deduce here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying just get the n_dims and out_elements from:
vvv_ind.size() and vvv_ind[0][0].size()?

dim.push_back(i);
return dim;
});
std::vector<std::size_t> idx(n_dims);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any value in defining idx it of size n_dims, when it gets cleared in the for-loop below?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On defining, idx is empty.

  • In each out_elements loop, idx is cleared.
  • In each n_dims loop, using one bit to pick elements from vvv_ind() last dimension, and compose a new vector.
  • Use the new vector to get index in graph.
  • Restart next out_elements loop.

idx.clear();
for(std::size_t dim = 0; dim < n_dims; dim++)
{
idx.push_back(static_cast<bool>(bi & std::size_t{1}) ? vvv_ind[dim][1][idx_e]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idx.push_back(vvv_ind[dim][bi &1][idx_e]);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take this one as it's much cleaner.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needs the size_t{1} here too I believe

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, still need keep size_t{1} for unsigned

{
if(i_dim == vvv_ind.size())
for(std::size_t start = 0; start < (std::size_t{1} << n_dims); start++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start < n_dims * 2. I don't know the logic here, but the loop can be simplified a bit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we need 2 ^ n_dims outside loop so have to use (std::size_t{1} << n_dims).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're relying on bitshifts I suppose to save cycles then? 1 cycle vs the more for mul/divides right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. And semantically the loop is to take each bit so shift left to get the range, which is power of 2, shift right to extract each bit. Although consequentially shift or mul/div is equal and compile should able to optimize to most fast one, however keep it shift left/right makes the entire logic more clearly.

{
std::size_t bi = start;
idx.clear();
for(std::size_t dim = 0; dim < n_dims; dim++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two for-loops in the prior code, and they seem to fill in a certain way -- which might be more complex than what is now done in this single for-loop. I am not clear on this code, but I would like to see some test cases being added with this PR, and that would help me see it better.

{
idx.push_back(static_cast<bool>(bi & std::size_t{1}) ? vvv_ind[dim][1][idx_e]
: vvv_ind[dim][0][idx_e]);
bi = bi >> std::size_t{1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bi = bi / 2; or bi >>=1;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revise to bi >>=1; But suggest keep >> as shift is much faster than divide and we only need divide by 2 here.

@@ -375,8 +354,9 @@ struct parse_resize : op_parser<parse_resize>
}
});

auto ind = calc_neighbor_points(
vvv_ind, 0, std::vector<std::vector<std::size_t>>(out_elements), in_s);
std::vector<int> ind;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as mentioned above, let ind need not be an input_parameter)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revise to the original way by returning a vector directly.

@coxuamd
Copy link
Author

coxuamd commented Dec 24, 2024

Thank you for taking a stab at cleaning this function. However, this PR needs two additional things (besides all test-cases to pass):

  • Code comments for calc_neighbor_points. And this is not the fault of this PR, but the comments were/are entirely missing, yet this function is very complicated for a reviewer's understanding, and I am not sure what exactly is it doing. And that documentation needs to be fixed now, since this function is being rewritten.
  • A unit-test to specifically test this extremely complex function. It should have been there much earlier, but it can be added now in this PR.

Here's the explanation of the new algorithm:
Given a std::vector<std::vector<std::vectorstd::size_t>> vvv_ind() in below layout:
vvv_ind{} = {
{{}, {}},
{{}, {}},
{{}, {}},
{{}, {}},
}

1-dim is n_dims, example n_dims = 4;
2-dim is hard-coded to 2, in the original algorithm, ref. as lo/hi
3-dim is a vector of integer with out_elements.

What the original calc_neighbor_points() algorithm is trying to do, is to compose a new vector, with size (2^n_dims * out_elements), within each is a n_dim vector of integer. In current example, if out_elements=16, n_dims=4, then we'll have a new vector as below:

vec_ind{} = {
{}, # Each element is a vector of 4 integer
{},
{},
..
{}, # 2^4 * 16 vectors in total
}

Let's re-write vvv_ind in different pattern for friendly understanding, where each character is a vector of 16 elements, like A = {0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1}.

AB
CD
EF
GH

The original calc_neighbor_points() is done in a recursive way that append elements in vertical, and expand in horizontal. Each 16-element vector will be transposed, notated as A':
A' = {
{0},
{1},
{1},
{1},
{0},
{1},
{1},
{1},
{0},
{1},
{1},
{1},
{0},
{1},
{1},
{1},
}

Recursion:
Pass 0:
A'
B'

Pass 1:
A'C'
B'C'
A'D'
B'D'

Pass 2:
A'C'E'
B'C'E'
A'D'E'
B'D'E'
A'C'F'
B'C'F'
A'D'F'
B'D'F'

Pass 3:
A'C'E'G'
B'C'E'G'
A'D'E'G'
B'D'E'G'
...
B'D'F'G'

Pass 4, last time, with the crafted vector, 2^4 * 16 elements, each element is a vector of 4 integer as the index, to get the index from shape in_s.index(idx)

Since the 2nd dimension is hardcoded to 2, we can treat this dimension as binary. What the final result (before in_s.index(idx)) is actaully we increase from 0 to 2^n_dim, and convert this n-dim bits value to binary, using the bit to index into 2nd-dim of vvv_ind (hi or low), use the position in n-dim to index into 1st-dim of vvv_ind, and loop out_elements so that all elements in A (and other capital character) can be looped.

Taking Pass 3 (before the final in_s.index(idx)) to explain:

vec_ind{}
0-1
AB
CD
EF
GH

Pass 3:
outer loop on 2^n_dims w/ start:
A'C'E'G' >> 0000b >> start(0) convert to 4-bit binary, using each bit to index to vec_ind
-- middle loop on out_elements, so that we process all out_elements(16) in same way.
---- reset bits shift from start, so on each element, shift can cover all bits.
------ inner loop: Pick each bit to index to vec_ind
B'C'E'G' >> 1000b >> start(1), ditto
A'D'E'G' >> 0100b >> start(2), ditto
B'D'E'G' >> 1100b >> start(3), ditto
...
B'D'F'G' >> 1111b

return dim;
});
std::vector<std::size_t> idx(n_dims);
for(std::size_t idx_e = 0; idx_e < out_elements; idx_e++)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename idx_e to something else as this makes it harder to follow when you have idx in the above loop and idx used right after with idx_e as one of the indicies in the access. At glance its easy to miss because of the similarity.

Something like elements_idx would be clearer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Yes let me rename the variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority A PR with high priority for review and merging. Perf Improve
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants