Some optimisations for reducing number of memory allocations and improving BVH build speed. #1319

Ono-Sendai · 2024-11-04T13:59:39Z

Hi Jorrit,
MeshShape building is a bit of a bottleneck in Substrata, since as the 3d models consist of user-generated content, I can't really build them ahead of time.
As such I wanted to reduce the number of memory allocations Jolt does, and also speed up BVH building a bit. The large number of memory allocations Jolt was doing was impacting multi-threaded computation quite a lot due to global allocator contention.

Results building a MeshShape with 152350 triangles:

Results before these optimisations are applied
---------------------------------------------------
createJoltShapeForBatchedMesh took 0.1375 s, num_allocs: 463825, min time so far: 0.1293 s

Results after optimisations are applied:
----------------------------------------------
createJoltShapeForBatchedMesh took 0.0872 s, num_allocs: 64, min time so far: 85.4597 ms

As you can see the number of allocations is decreased rather a lot :)

The code is not throughly tested, I mostly banged it out this evening, apart from using my existing HashMap and HashSet code. I'm not wedded to the HashMap and HashSet implementation, we just need something better than the standard library trash.

…ad allocate nodes and leaf triangles from single Arrays.

…ocations.

…hich uses linear probing. Unlike std::unordered_map, it doesn't make an allocation for each insertion, however iterators are invalidated upon item insertion. Also requires passing an 'empty key' argument to the constructor. Is generally faster than std::unordered_map due to the user of open addressing and not doing so many allocations.

…to bins on all 3 dimensions, as each triangle is processed. Not thoroughly tested for correctness. Results, for a mesh with 152350 tris: Before optimisations --------------------- createJoltShapeForBatchedMesh took 0.1297 s, num_allocs: 64, min time so far: 109.7628 ms With optimisations ------------------- createJoltShapeForBatchedMesh took 0.0872 s, num_allocs: 64, min time so far: 85.4597 ms

CLAassistant · 2024-11-04T13:59:45Z

All committers have signed the CLA.

jrouwe · 2024-11-04T22:08:19Z

Thanks! I'm going to need some time some time to process/test this.

Ono-Sendai · 2024-11-05T01:14:30Z

Here's a before optimisations vs. after optimisations profile trace: (from Tracy profiler)

The sloped lines in the memory usage graph in the 'before' trace are Jolt (in particular std::unordered_map / std::unordered_set) doing lots of little allocations :)

jrouwe · 2024-11-09T13:23:43Z

FYI: I'm going to integrate this in stages. First the changes to AABBTree, then I'll look at the bins and finally at the map/set

Ono-Sendai · 2024-11-09T14:35:36Z

Awesome!
I apologise for not testing the changes more thoroughly. All I can say is that I have used the changed code and it seems to work fine :)

Ono-Sendai · 2024-11-09T14:36:36Z

avoiding AABBTree node allocations could also be done with a custom allocator, might be easier.

jrouwe · 2024-11-09T14:41:33Z

avoiding AABBTree node allocations could also be done with a custom allocator, might be easier.

Too late, it's integrated now.

Ono-Sendai · 2024-11-09T14:54:10Z

avoiding AABBTree node allocations could also be done with a custom allocator, might be easier.

Too late, it's integrated now.

haha ok.
I think the approach with a single vector is good for a single-threaded builder. The approach I use for a multithreaded builder (an idea from Embree) is to allocate nodes in per-thread buffers of e.g. 256 nodes. Then these are merged in a final pass in the main thread.

jrouwe · 2024-11-09T19:51:26Z

TriangleSplitterBinning has been merged as well. Note that your version has a division by zero when the bounding box has 0 size in one of the dimensions.

Ono-Sendai added 4 commits November 5, 2024 01:08

Avoid lots of individual memory allocations in AABBTreeBuilder. Inste…

af1c7a9

…ad allocate nodes and leaf triangles from single Arrays.

using Array instead of Deque in AABBTreeToBuffer. Avoids a lot of all…

ca212da

…ocations.

Fix not using JPH::AlignedAllocate.

2c9c47a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some optimisations for reducing number of memory allocations and improving BVH build speed. #1319

Some optimisations for reducing number of memory allocations and improving BVH build speed. #1319

Ono-Sendai commented Nov 4, 2024

CLAassistant commented Nov 4, 2024 •

edited

Loading

jrouwe commented Nov 4, 2024

Ono-Sendai commented Nov 5, 2024 •

edited

Loading

jrouwe commented Nov 9, 2024

Ono-Sendai commented Nov 9, 2024

Ono-Sendai commented Nov 9, 2024

jrouwe commented Nov 9, 2024

Ono-Sendai commented Nov 9, 2024

jrouwe commented Nov 9, 2024

Some optimisations for reducing number of memory allocations and improving BVH build speed. #1319

Are you sure you want to change the base?

Some optimisations for reducing number of memory allocations and improving BVH build speed. #1319

Conversation

Ono-Sendai commented Nov 4, 2024

CLAassistant commented Nov 4, 2024 • edited Loading

jrouwe commented Nov 4, 2024

Ono-Sendai commented Nov 5, 2024 • edited Loading

jrouwe commented Nov 9, 2024

Ono-Sendai commented Nov 9, 2024

Ono-Sendai commented Nov 9, 2024

jrouwe commented Nov 9, 2024

Ono-Sendai commented Nov 9, 2024

jrouwe commented Nov 9, 2024

CLAassistant commented Nov 4, 2024 •

edited

Loading

Ono-Sendai commented Nov 5, 2024 •

edited

Loading