Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some optimisations for reducing number of memory allocations and improving BVH build speed. #1319

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Ono-Sendai
Copy link
Contributor

Hi Jorrit,
MeshShape building is a bit of a bottleneck in Substrata, since as the 3d models consist of user-generated content, I can't really build them ahead of time.
As such I wanted to reduce the number of memory allocations Jolt does, and also speed up BVH building a bit. The large number of memory allocations Jolt was doing was impacting multi-threaded computation quite a lot due to global allocator contention.

Results building a MeshShape with 152350 triangles:

Results before these optimisations are applied
---------------------------------------------------
createJoltShapeForBatchedMesh took 0.1375 s, num_allocs: 463825, min time so far: 0.1293 s

Results after optimisations are applied:
----------------------------------------------
createJoltShapeForBatchedMesh took 0.0872 s, num_allocs: 64, min time so far: 85.4597 ms

As you can see the number of allocations is decreased rather a lot :)

The code is not throughly tested, I mostly banged it out this evening, apart from using my existing HashMap and HashSet code. I'm not wedded to the HashMap and HashSet implementation, we just need something better than the standard library trash.

…ad allocate nodes and leaf triangles from single Arrays.
…hich uses linear probing.

Unlike std::unordered_map, it doesn't make an allocation for each insertion, however iterators are invalidated upon item insertion.
Also requires passing an 'empty key' argument to the constructor.
Is generally faster than std::unordered_map due to the user of open addressing and not doing so many allocations.
…to bins on all 3 dimensions, as each triangle is processed.

Not thoroughly tested for correctness.

Results, for a mesh with 152350 tris:

Before optimisations
---------------------
createJoltShapeForBatchedMesh took 0.1297 s, num_allocs: 64, min time so far: 109.7628 ms

With optimisations
-------------------
createJoltShapeForBatchedMesh took 0.0872 s, num_allocs: 64, min time so far: 85.4597 ms
@CLAassistant
Copy link

CLAassistant commented Nov 4, 2024

CLA assistant check
All committers have signed the CLA.

@jrouwe
Copy link
Owner

jrouwe commented Nov 4, 2024

Thanks! I'm going to need some time some time to process/test this.

@Ono-Sendai
Copy link
Contributor Author

Ono-Sendai commented Nov 5, 2024

Here's a before optimisations vs. after optimisations profile trace: (from Tracy profiler)

The sloped lines in the memory usage graph in the 'before' trace are Jolt (in particular std::unordered_map / std::unordered_set) doing lots of little allocations :)

without jolt optimisations

with jolt optimisations

@jrouwe
Copy link
Owner

jrouwe commented Nov 9, 2024

FYI: I'm going to integrate this in stages. First the changes to AABBTree, then I'll look at the bins and finally at the map/set

@Ono-Sendai
Copy link
Contributor Author

Awesome!
I apologise for not testing the changes more thoroughly. All I can say is that I have used the changed code and it seems to work fine :)

@Ono-Sendai
Copy link
Contributor Author

avoiding AABBTree node allocations could also be done with a custom allocator, might be easier.

@jrouwe
Copy link
Owner

jrouwe commented Nov 9, 2024

avoiding AABBTree node allocations could also be done with a custom allocator, might be easier.

Too late, it's integrated now.

@Ono-Sendai
Copy link
Contributor Author

avoiding AABBTree node allocations could also be done with a custom allocator, might be easier.

Too late, it's integrated now.

haha ok.
I think the approach with a single vector is good for a single-threaded builder. The approach I use for a multithreaded builder (an idea from Embree) is to allocate nodes in per-thread buffers of e.g. 256 nodes. Then these are merged in a final pass in the main thread.

@jrouwe
Copy link
Owner

jrouwe commented Nov 9, 2024

TriangleSplitterBinning has been merged as well. Note that your version has a division by zero when the bounding box has 0 size in one of the dimensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants