Optimize CityOnPlanet CPU-side rendering cost #5943

sturnclaw · 2024-10-31T23:41:38Z

As is usual for this codebase, you can't change one thing without touching a dozen other unrelated files.

This PR addresses some of the low hanging optimization fruit in CityOnPlanet. A short summary is that it moves as much of the matrix math out of the frustum culling inner loop as possible, turning it into a simple test-and-copy loop.

The total effect of these optimizations is that the frustum-cull-and-copy pass went from ~220us to ~100us on my machine, representing anything from a 20% to 40% speedup to the overall performance cost of CityOnPlanet::Render().

The PR also does a decent amount of cleanup and tidying in other parts of the codebase. Of particular note is that we no longer have two duplicate implementations of world-space to screen-space projection - this is all handled by Graphics::ProjectToScreen<T> now. As a consequence, Graphics::Frustum is now 192 bytes, down from ~480 bytes. I've also replaced Frustum::TranslatePoint()'s incremental approximation with an actual analytic solution to scaling a point to fit within the far plane of the frustum.

@fluffyfreak I invite you to test this with your Raspberry Pi branch - I suspect you'll be a bit less CPU-bound when rendering cities, and might even be able to bump up the detail level a bit.

There is another major performance optimization that could be done in the future - currently, we create a separate instance buffer and upload all of this matrix data for every single StaticGeometry node in each SceneGraph::Model that we're rendering instances of.

This is because LOD switches are "just another node" and they split the array of instance transforms into smaller subarrays that are dispatched to each child group. Because they're "just another node", absolutely no assumptions can be made about where they are in the node hierarchy or how many of them there are, and thus every single StaticGeometry node has to assume it's the only one seeing that specific list of instance transforms and uploads a separate copy of the list to the GPU to be drawn.

I shouldn't have to specify that this is a MASSIVE waste of memory bandwidth, especially on integrated GPUs and handheld devices (i.e. Steam Deck) where system memory speeds are... not the greatest to start with.

Unfortunately, a proper solution to this is non-trivial. I have a long-range plan to "flatten" the node hierarchy when rendering, allowing model shaders to pull in animated MatrixTransform data from an array stored uniform buffer rather than modifying the draw transform for every node, but that's somewhat orthogonal to the problem here.

I'd like to properly solidify the assumption set by Loader that LOD switches happen at the root of a model's hierarchy, moving the LOD-related code out of the LOD node and into the Model itself. That would allow the Model to generate and upload an instance buffer at top-level, and forward the instance buffer handle to the child geometry nodes. Some special handling would still be necessary for ModelNodes, but it'd be better than the current system.

Your thoughts and suggestions would be much appreciated, @fluffyfreak!

- Instead of setting the global transform to identity inside StaticGeometry, allow the caller to determine what the "base" transform should be

- Return viewport-normalized coordinates in 0..1 instead of pixel coordinates in 0..w, 0..h - Add deprecation notice to ProjectToScreen overloads that read renderer internal state and return direct pixel coordinates

- The projection matrix is quite useful to know in a lot of places - Will eventually replace asking the Renderer for its internal projection matrix state

- WorldView was only user, and the logic was duplicated in Graphics::ProjectToScreen - If w-component of clip space is 0, all components of returned viewport coordinates will be +-inf - no need to introduce additional flow control - Allows drastically simplifying Frustum class

- modelview / projection matrices are completely unnecessary now, only the planes are required for frustum testing. - TranslatePoint was a complete kludge and its purpose is better served by an analytic solution.

- Allow constructing independent float/double versions of Graphics::Frustum where needed for speed or memory bandwidth concerns

- Precalculate instance transforms for each potentially-visible building at startup rather than every single frame. - Move all matrix multiplication out of the inner loop and do frustum culling with an optimized single-precision float frustum structure. - Pass a CameraContext rather than a Frustum to CityOnPlanet::Render for top-level frustum culling. - Add comments indicating potential future performance improvements.

fluffyfreak · 2024-11-05T20:24:47Z

Oh boy yes this is exciting, there's a lot here but I'll take a look at this asap 👍
A quick read through and the code looks excellent as always!

fluffyfreak · 2024-11-07T17:07:02Z

Tested this in Windows but not had chance to on Pi yet, it works perfectly though. Not seen any issues with it so it'a already good to merge 👍

fluffyfreak

This is good code and I approve of all the changes. Lots of cleanup going on in here of stuff that's needed tidying up and improving for a long time. Great job!

fluffyfreak · 2024-11-08T19:00:50Z

Testing on Pi shows no real difference at the moment, maybe 1 or 2 FPS but I didn't have time to dig deeper tonight 👍

sturnclaw · 2024-11-08T19:40:50Z

That sounds about right - I expect the CPU-side cost wasn't the limiting factor here when running on Very Low city detail.

@fluffyfreak I'm going to merge this, but please feel free to provide feedback here on the proposed change to model LODs.

If you want, you can export a JSON frame profile (with ProfilerZoneOutput=1 in config) from the Pi and I'll take a gander and see if there are any particular low-hanging fruit related to CPU performance that we could address - the "worst" device I have access to right now is more than good enough to run Pioneer at 60FPS on low settings.

fluffyfreak · 2024-11-09T14:49:25Z

Unfortunately on RPi compiling with -DPROFILER_ENABLED=ON produces a SIGSEGV, Segmentation fault. Just before reaching the menu, or sometimes getting that far but then doing it when you try to go ingame 🤷

Info: LoadSurfaceFromFile: models/cockpits/default_cockpit/default_cockpit_norm.png: could not read file
Info: Created shader sphereimpostor (address=0x5555605ebc00)
Info: Listening for headtracking packets on 127.0.0.1:4242

Thread 1 "pioneer" received signal SIGSEGV, Segmentation fault.
Profiler::Caller::ForEachByRefProfiler::Caller::foreach::SoftReset (mapto=..., this=0x55555beec7b0) at /home/fluffyfreak/pioneer/contrib/profiler/Profiler.cpp:686
686 mapto( mBuckets[ i ] );
(gdb) bt
#0 Profiler::Caller::ForEachByRefProfiler::Caller::foreach::SoftReset (mapto=..., this=0x55555beec7b0) at /home/fluffyfreak/pioneer/contrib/profiler/Profiler.cpp:686
#1 Profiler::Caller::ForEachProfiler::Caller::foreach::SoftReset (mapto=..., this=0x55555beec7b0) at /home/fluffyfreak/pioneer/contrib/profiler/Profiler.cpp:698
#2 Profiler::Caller::SoftReset (this=0x55555beec7b0) at /home/fluffyfreak/pioneer/contrib/profiler/Profiler.cpp:808
#3 Profiler::resetThreads () at /home/fluffyfreak/pioneer/contrib/profiler/Profiler.cpp:1512
#4 0x000055555591f074 in Profiler::reset () at /home/fluffyfreak/pioneer/contrib/profiler/Profiler.cpp:1658
#5 0x00005555555b9d50 in Application::Run (this=0x555555b4a110) at /home/fluffyfreak/pioneer/src/core/Application.cpp:234
#6 0x00005555555b0378 in main (argc=, argv=0x7fffffffef18) at /home/fluffyfreak/pioneer/src/Pi.h:144
(gdb)

The Profiler also seems to cause issues sometime on Visual Studio/Windows where it will soft lock on the PROFILE_SCOPED() within void RunHandler(double delta, const std::string &handler) and then other times it works just fine.

fluffyfreak · 2024-11-09T14:52:23Z

Speaking of Profiler issues, to build with Profiling enabled on Windows I had to change Frame.cpp like this:

src/Frame.cpp | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/Frame.cpp b/src/Frame.cpp
index 7abeb275e..931bb64fc 100644
--- a/src/Frame.cpp
+++ b/src/Frame.cpp
@@ -305,7 +305,7 @@ void Frame::CollideFrames(void (*callback)(CollisionContact *))
if (!frame.m_collisionSpace)
continue;

- PROFILE_SCOPED_DESC(frame.m_label.c_str())
+ PROFILE_SCOPED_RAW(frame.m_label.c_str())
frame.m_collisionSpace->Collide(callback);
}
}

sturnclaw · 2024-11-11T19:11:49Z

Hmm - is this with zone profiling enabled or not? I'll take a look at it, but I suspect some platform specific weirdness.

fluffyfreak · 2024-11-11T21:16:59Z

With zone profiling enabled, without it enabled profiling does pretty much nothing on Windows.

fluffyfreak · 2024-11-11T21:21:10Z

on Pi, with zone profiling enabled, I can get to the main menu but going into a game causes the SIGSEGV segmentation fault listed above.

EDIT: Setting the number of worker threads to 1 just causes an immediate SIGSEGV before the main menu. It seems to be quite unpredictable which suggests some interaction with the threading.

fluffyfreak · 2024-11-14T17:26:11Z

@Web-eWorks yeah looking through the instanced rendering within the scenegraph it's pretty horrible. I'm not sure that I understand you're proposal yet but I'm still thinking about it

sturnclaw · 2024-11-14T18:29:49Z

Hah! I found the crash. If USE_CHRONO is defined, Profiler.cpp completely disables its threading support. I'll work on a PR to enable multithread-with-std::chrono and move the discussion there.

sturnclaw added 9 commits October 31, 2024 12:50

Rename instancing Model::Render -> RenderInstanced

3231d98

Move Plane/Sphere helper classes to math/ folder

310513a

SceneGraph: properly handle top-level transform when rendering instanced

64eaa0e

- Instead of setting the global transform to identity inside StaticGeometry, allow the caller to determine what the "base" transform should be

Remove ViewportExtents param from ProjectToScreen

33fa3da

- Return viewport-normalized coordinates in 0..1 instead of pixel coordinates in 0..w, 0..h - Add deprecation notice to ProjectToScreen overloads that read renderer internal state and return direct pixel coordinates

Camera: cache projection matrix

3dc8cc3

- The projection matrix is quite useful to know in a lot of places - Will eventually replace asking the Renderer for its internal projection matrix state

Frustum: remove model/projection matrix, TranslatePoint() method

fab84c9

- modelview / projection matrices are completely unnecessary now, only the planes are required for frustum testing. - TranslatePoint was a complete kludge and its purpose is better served by an analytic solution.

Frustum: template on floating point number type

c6faa67

- Allow constructing independent float/double versions of Graphics::Frustum where needed for speed or memory bandwidth concerns

sturnclaw added Performance C++ code Rendering Everything related to rendering labels Oct 31, 2024

sturnclaw requested a review from fluffyfreak October 31, 2024 23:41

fluffyfreak approved these changes Nov 7, 2024

View reviewed changes

sturnclaw merged commit 2ac32b7 into pioneerspacesim:master Nov 8, 2024
4 checks passed

sturnclaw deleted the render-instanced branch November 8, 2024 19:41

sturnclaw mentioned this pull request Nov 14, 2024

Linux de-profile and msvc compilation fix #5967

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CityOnPlanet CPU-side rendering cost #5943

Optimize CityOnPlanet CPU-side rendering cost #5943

sturnclaw commented Oct 31, 2024

fluffyfreak commented Nov 5, 2024

fluffyfreak commented Nov 7, 2024

fluffyfreak left a comment

fluffyfreak commented Nov 8, 2024

sturnclaw commented Nov 8, 2024

fluffyfreak commented Nov 9, 2024

fluffyfreak commented Nov 9, 2024 •

edited

Loading

sturnclaw commented Nov 11, 2024

fluffyfreak commented Nov 11, 2024

fluffyfreak commented Nov 11, 2024 •

edited

Loading

fluffyfreak commented Nov 14, 2024

sturnclaw commented Nov 14, 2024

Optimize CityOnPlanet CPU-side rendering cost #5943

Optimize CityOnPlanet CPU-side rendering cost #5943

Conversation

sturnclaw commented Oct 31, 2024

fluffyfreak commented Nov 5, 2024

fluffyfreak commented Nov 7, 2024

fluffyfreak left a comment

Choose a reason for hiding this comment

fluffyfreak commented Nov 8, 2024

sturnclaw commented Nov 8, 2024

fluffyfreak commented Nov 9, 2024

fluffyfreak commented Nov 9, 2024 • edited Loading

sturnclaw commented Nov 11, 2024

fluffyfreak commented Nov 11, 2024

fluffyfreak commented Nov 11, 2024 • edited Loading

fluffyfreak commented Nov 14, 2024

sturnclaw commented Nov 14, 2024

fluffyfreak commented Nov 9, 2024 •

edited

Loading

fluffyfreak commented Nov 11, 2024 •

edited

Loading