-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce Uniform/Independent-Elements iterator register footprint #2383
reduce Uniform/Independent-Elements iterator register footprint #2383
Conversation
6a27beb
to
35b213f
Compare
IndependentElements iterator register footprint in the test goes down from 28 registers to 22. |
@fwyzard the implementation for |
be39e00
to
e1f0153
Compare
fix alpaka-group#2382 Rewrite UniformElements iterator to reduce the register footprint. - avoid multiple return within a function - reduce the iterator state size by one element
e1f0153
to
b30a4e1
Compare
@fwyzard CI passed, ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic of UniformElements
looks correct.
Does the code work if you use 0
instead of Idx{0}
, or does the compiler uselessly complain ?
What do you think would be the impact on the register usage and performance if the iterator kept a pointer to the |
Sure. Currently they need to be |
This will most likely not help and in case you create the iterator within a function it is very easy to produce illegal memory access because the main object is already destroyed. |
Apply same optimizations apllied to UniformElements.
b30a4e1
to
4c06be4
Compare
It doesn't.
Well, using an iterator after the container it iterates over has been destroyed is definitely a programming error - the same applies to all containers, like a
|
ready to be merged |
reduce register footprint
fix #2382
Rewrite Uniform/Independent-Elements iterator to reduce the register footprint.
Change bufferCopy example, use the element layer for CPU accelerators.
The original code before using the UniformElements in the bufferCopy example required 38 and 56 registers.
With this PR we requires 64 and 71 registers but this can not be compared to the old develop branch because the old implementation ignores the element layer.
Compared to the current development branch where 80 and 85 registers are required this refactoring is a huge improvement and will increase the occupancy in real word usage.
The increment of the iterator have now always the cost of two integral value increments and two comparisons where the comparisons will be transformed into comparison and selection instruction on GPU devices therefore there is no branching overhead.