Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce Uniform/Independent-Elements iterator register footprint #2383

Merged

Conversation

psychocoderHPC
Copy link
Member

@psychocoderHPC psychocoderHPC commented Sep 16, 2024

reduce register footprint

fix #2382

Rewrite Uniform/Independent-Elements iterator to reduce the register footprint.

  • avoid multiple returns within a function
  • reduce the iterator state size by one element

Change bufferCopy example, use the element layer for CPU accelerators.

The original code before using the UniformElements in the bufferCopy example required 38 and 56 registers.
With this PR we requires 64 and 71 registers but this can not be compared to the old develop branch because the old implementation ignores the element layer.
Compared to the current development branch where 80 and 85 registers are required this refactoring is a huge improvement and will increase the occupancy in real word usage.

The increment of the iterator have now always the cost of two integral value increments and two comparisons where the comparisons will be transformed into comparison and selection instruction on GPU devices therefore there is no branching overhead.

ptxas info    : 219055 bytes gmem, 72 bytes cmem[3]
ptxas info    : Compiling entry function '_ZN6alpaka6detail9gpuKernelI17PrintBufferKernelNS_9ApiCudaRtENS_22AccGpuUniformCudaHipRtIS3_St17integral_constantImLm3EEmEES6_mJNSt12experimental6mdspanIjNS8_7extentsImJLm18446744073709551615ELm18446744073709551615ELm18446744073709551615EEEENS8_13layout_strideENS_12experimental6traits6detail19ByteIndexedAccessorIjEEEEEEEvNS_3VecIT2_T3_EET_DpT4_' for 'sm_52'
ptxas info    : Function properties for _ZN6alpaka6detail9gpuKernelI17PrintBufferKernelNS_9ApiCudaRtENS_22AccGpuUniformCudaHipRtIS3_St17integral_constantImLm3EEmEES6_mJNSt12experimental6mdspanIjNS8_7extentsImJLm18446744073709551615ELm18446744073709551615ELm18446744073709551615EEEENS8_13layout_strideENS_12experimental6traits6detail19ByteIndexedAccessorIjEEEEEEEvNS_3VecIT2_T3_EET_DpT4_
    32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 408 bytes cmem[0]
ptxas info    : Compiling entry function '_ZN6alpaka6detail9gpuKernelI16TestBufferKernelNS_9ApiCudaRtENS_22AccGpuUniformCudaHipRtIS3_St17integral_constantImLm3EEmEES6_mJNSt12experimental6mdspanIjNS8_7extentsImJLm18446744073709551615ELm18446744073709551615ELm18446744073709551615EEEENS8_13layout_strideENS_12experimental6traits6detail19ByteIndexedAccessorIjEEEEEEEvNS_3VecIT2_T3_EET_DpT4_' for 'sm_52'
ptxas info    : Function properties for _ZN6alpaka6detail9gpuKernelI16TestBufferKernelNS_9ApiCudaRtENS_22AccGpuUniformCudaHipRtIS3_St17integral_constantImLm3EEmEES6_mJNSt12experimental6mdspanIjNS8_7extentsImJLm18446744073709551615ELm18446744073709551615ELm18446744073709551615EEEENS8_13layout_strideENS_12experimental6traits6detail19ByteIndexedAccessorIjEEEEEEEvNS_3VecIT2_T3_EET_DpT4_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 71 registers, 408 bytes cmem[0]

@psychocoderHPC psychocoderHPC added this to the 1.2.0 milestone Sep 16, 2024
@psychocoderHPC psychocoderHPC changed the title reduce UniformElements iterator register footprint reduce Uniform/Independent-Elements iterator register footprint Sep 16, 2024
@psychocoderHPC
Copy link
Member Author

IndependentElements iterator register footprint in the test goes down from 28 registers to 22.

@psychocoderHPC
Copy link
Member Author

@fwyzard the implementation for const_iterator in IndependentElements and UniformElements is 100% identical, should we in a follow up PR provide the implementation once and avoid code duplication?

fix alpaka-group#2382

Rewrite UniformElements iterator to reduce the register footprint.

- avoid multiple return within a function
- reduce the iterator state size by one element
@psychocoderHPC
Copy link
Member Author

@fwyzard CI passed, ready for review

Copy link
Contributor

@fwyzard fwyzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of UniformElements looks correct.

Does the code work if you use 0 instead of Idx{0}, or does the compiler uselessly complain ?

include/alpaka/exec/UniformElements.hpp Outdated Show resolved Hide resolved
include/alpaka/exec/UniformElements.hpp Outdated Show resolved Hide resolved
@fwyzard
Copy link
Contributor

fwyzard commented Sep 19, 2024

What do you think would be the impact on the register usage and performance if the iterator kept a pointer to the UniformElementsAlong object instead of storing elements_, stride_ and extent_ ?

@fwyzard
Copy link
Contributor

fwyzard commented Sep 19, 2024

@fwyzard the implementation for const_iterator in IndependentElements and UniformElements is 100% identical, should we in a follow up PR provide the implementation once and avoid code duplication?

Sure.

Currently they need to be friend of their respective classes, but maybe that can be passed as a template argument ?

@psychocoderHPC
Copy link
Member Author

What do you think would be the impact on the register usage and performance if the iterator kept a pointer to the UniformElementsAlong object instead of storing elements_, stride_ and extent_ ?

This will most likely not help and in case you create the iterator within a function it is very easy to produce illegal memory access because the main object is already destroyed.
Never the less coping a constant value is not the problem, the compile is optimizing this out. I also played with keeping the ACC in the iterator as const reference (this is safe from alpaka side) and query the number of elements each time because they are stored in the accelerator. This is not decreasing the memory footprint.

Apply same optimizations apllied to UniformElements.
@fwyzard
Copy link
Contributor

fwyzard commented Sep 19, 2024

This will most likely not help

It doesn't.

and in case you create the iterator within a function it is very easy to produce illegal memory access because the main object is already destroyed.

Well, using an iterator after the container it iterates over has been destroyed is definitely a programming error - the same applies to all containers, like a vector, etc.

UniformElementsND already uses this approach to avoid copying a large number of dimensions.

@psychocoderHPC
Copy link
Member Author

ready to be merged

@fwyzard fwyzard merged commit 11ab218 into alpaka-group:develop Sep 20, 2024
22 checks passed
@psychocoderHPC psychocoderHPC deleted the topic-iteratorRegisterFootprint branch September 20, 2024 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reduce register footprint for the new iterators
2 participants