-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sorted all invocations of alloca #2981
Sorted all invocations of alloca #2981
Conversation
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I remember we encounted segfaults because of alloca
in the past and added the hoisting pass to move it outside loops. It's a pity to hear that that pass does not work with scope_alloca
.
I see this pass in MLIR to promote small alloc to alloca, not sure if it works or not.
@@ -215,6 +217,7 @@ ZTensor ZTensorHelper::getZTensor(Value bufferPtr, zdnn_data_types dataType, | |||
Value transformedDescPtr = | |||
getTransformedDescPtr(preTransformedDescPtr, isConcat, concatInfo); | |||
// Create the input zTensor. | |||
// TODO: evaluate if a heap alloc would not be better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I often see in MLIR they use alloca to create a LLVM struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, it's still small. Want me to remove the comment? @tungld
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's ok to remove it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it with an explanation of why it is good to have alloca here
// eventually all the refs to alloca to be register/spill access, not memory | ||
// load/stores. | ||
Value TmpProd = create.mem.alignedAlloca(CTmpType, BUFFER_ALIGN); | ||
Value TmpProd = create.mem.alignedAlloc(CTmpType, BUFFER_ALIGN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a good practice to use alignedAlloca
for scalar type, should we add a verification in create.mem.alignedAlloca
that checks type would be scalar. By that way, we can avoid the situation where users call create.mem.alignedAlloca
for non-scalar type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left for another PR.
Now I also manually moved the case with no parallel out of the loops, just to make sure that is fine. Would you like me to check if it's really needed? That way we could have the call always inside, and if parallel is there, its blocked and otherwise it is not? In one case, I really need it outside manually, because it was inside of an inner-loop if-then-else. So by having it outside of the if-then-else (allocating in all case), it allows it to migrate to the outside. |
Signed-off-by: Alexandre Eichenberger <[email protected]>
Responded to comments, and removed the placement of most "alloc" before/inside the main loop depending on parallel. Placing it inside works in both cases. If its parallel, it remains inside because of the inserted |
Great, thanks! It was what I expected. |
Jenkins Linux amd64 Build #15879 [push] Sorted all invocations o... started at 22:37 |
Jenkins Linux s390x Build #15882 [push] Sorted all invocations o... started at 23:37 |
Jenkins Linux ppc64le Build #14909 [push] Sorted all invocations o... started at 23:50 |
Jenkins Linux amd64 Build #15879 [push] Sorted all invocations o... passed after 1 hr 14 min |
Jenkins Linux s390x Build #15882 [push] Sorted all invocations o... passed after 1 hr 34 min |
Jenkins Linux ppc64le Build #14909 [push] Sorted all invocations o... passed after 2 hr 21 min |
Alloca was shown to be an issue with large benchmarks, where too much stack alloc ended up crashing the process. Alloca are typically used for temporaries needed within the computations of an onnx operation.
Went through all instances of
alloca
, and left alone the scalars (allocating a few bytes per operations is fine). For the larger data, I migrated it toalloc
.Because
alloc
results in a more expensive call to allocate and free, when not using parallelism, I moved such calls outside the loops. But when using parallelism, such call must be inside the parallel region as each parallel thread must have its version of temporaries, so in such cases thealloc
stayed inside the parallel loop.I ensured that the buffer hoisting did not bring the alloc outside of the
scope_alloca
, which seems to also catch thealloc
operations. I ran individual checks on several key operations, as at this time the verifying to parallel operations in our CIs are still very sketchy.