Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorted all invocations of alloca #2981

Merged
merged 7 commits into from
Oct 18, 2024

Conversation

AlexandreEichenberger
Copy link
Collaborator

Alloca was shown to be an issue with large benchmarks, where too much stack alloc ended up crashing the process. Alloca are typically used for temporaries needed within the computations of an onnx operation.

Went through all instances of alloca, and left alone the scalars (allocating a few bytes per operations is fine). For the larger data, I migrated it to alloc.

Because alloc results in a more expensive call to allocate and free, when not using parallelism, I moved such calls outside the loops. But when using parallelism, such call must be inside the parallel region as each parallel thread must have its version of temporaries, so in such cases the alloc stayed inside the parallel loop.

I ensured that the buffer hoisting did not bring the alloc outside of the scope_alloca, which seems to also catch the alloc operations. I ran individual checks on several key operations, as at this time the verifying to parallel operations in our CIs are still very sketchy.

  CheckONNXModel.py -m gemm.mlir -r="-O3 -march=arm64" -a="-parallel"
  CheckONNXModel.py -m reduceMin1.mlir -r="-O3 -march=arm64 -shapeInformation=0:1024x1024x1024" -a="-parallel"
  CheckONNXModel.py -m layernorm4d.mlir -r="-O3 -march=arm64 " -a="-parallel"

Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Copy link
Collaborator

@tungld tungld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I remember we encounted segfaults because of alloca in the past and added the hoisting pass to move it outside loops. It's a pity to hear that that pass does not work with scope_alloca.

I see this pass in MLIR to promote small alloc to alloca, not sure if it works or not.

@@ -215,6 +217,7 @@ ZTensor ZTensorHelper::getZTensor(Value bufferPtr, zdnn_data_types dataType,
Value transformedDescPtr =
getTransformedDescPtr(preTransformedDescPtr, isConcat, concatInfo);
// Create the input zTensor.
// TODO: evaluate if a heap alloc would not be better.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often see in MLIR they use alloca to create a LLVM struct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, it's still small. Want me to remove the comment? @tungld

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's ok to remove it. Thanks!

Copy link
Collaborator Author

@AlexandreEichenberger AlexandreEichenberger Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it with an explanation of why it is good to have alloca here

// eventually all the refs to alloca to be register/spill access, not memory
// load/stores.
Value TmpProd = create.mem.alignedAlloca(CTmpType, BUFFER_ALIGN);
Value TmpProd = create.mem.alignedAlloc(CTmpType, BUFFER_ALIGN);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a good practice to use alignedAlloca for scalar type, should we add a verification in create.mem.alignedAlloca that checks type would be scalar. By that way, we can avoid the situation where users call create.mem.alignedAlloca for non-scalar type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left for another PR.

@AlexandreEichenberger
Copy link
Collaborator Author

I remember we encounted segfaults because of alloca in the past and added the hoisting pass to move it outside loops. It's a pity to hear that that pass does not work with scope_alloca.

scope_alloca stop the migration of alloc and alloca out of that scope. It is important for parallel regions, as having it stay in the parallel region means that each thread will get its own. So that works fine.

Now I also manually moved the case with no parallel out of the loops, just to make sure that is fine. Would you like me to check if it's really needed? That way we could have the call always inside, and if parallel is there, its blocked and otherwise it is not?

In one case, I really need it outside manually, because it was inside of an inner-loop if-then-else. So by having it outside of the if-then-else (allocating in all case), it allows it to migrate to the outside.

@AlexandreEichenberger
Copy link
Collaborator Author

Responded to comments, and removed the placement of most "alloc" before/inside the main loop depending on parallel. Placing it inside works in both cases. If its parallel, it remains inside because of the inserted scope_alloca; if its sequential, then there is no scope and it naturally migrate outside the main loop.

@tungld
Copy link
Collaborator

tungld commented Oct 18, 2024

If its parallel, it remains inside because of the inserted scope_alloca; if its sequential, then there is no scope and it naturally migrate outside the main loop.

Great, thanks! It was what I expected.

@AlexandreEichenberger AlexandreEichenberger merged commit 1435011 into onnx:main Oct 18, 2024
6 of 7 checks passed
@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15879 [push] Sorted all invocations o... started at 22:37

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15882 [push] Sorted all invocations o... started at 23:37

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14909 [push] Sorted all invocations o... started at 23:50

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15879 [push] Sorted all invocations o... passed after 1 hr 14 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15882 [push] Sorted all invocations o... passed after 1 hr 34 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14909 [push] Sorted all invocations o... passed after 2 hr 21 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants