Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize CityOnPlanet CPU-side rendering cost #5943

Merged
merged 9 commits into from
Nov 8, 2024

Conversation

sturnclaw
Copy link
Member

As is usual for this codebase, you can't change one thing without touching a dozen other unrelated files.

This PR addresses some of the low hanging optimization fruit in CityOnPlanet. A short summary is that it moves as much of the matrix math out of the frustum culling inner loop as possible, turning it into a simple test-and-copy loop.

The total effect of these optimizations is that the frustum-cull-and-copy pass went from ~220us to ~100us on my machine, representing anything from a 20% to 40% speedup to the overall performance cost of CityOnPlanet::Render().

The PR also does a decent amount of cleanup and tidying in other parts of the codebase. Of particular note is that we no longer have two duplicate implementations of world-space to screen-space projection - this is all handled by Graphics::ProjectToScreen<T> now. As a consequence, Graphics::Frustum is now 192 bytes, down from ~480 bytes. I've also replaced Frustum::TranslatePoint()'s incremental approximation with an actual analytic solution to scaling a point to fit within the far plane of the frustum.

@fluffyfreak I invite you to test this with your Raspberry Pi branch - I suspect you'll be a bit less CPU-bound when rendering cities, and might even be able to bump up the detail level a bit.


There is another major performance optimization that could be done in the future - currently, we create a separate instance buffer and upload all of this matrix data for every single StaticGeometry node in each SceneGraph::Model that we're rendering instances of.

This is because LOD switches are "just another node" and they split the array of instance transforms into smaller subarrays that are dispatched to each child group. Because they're "just another node", absolutely no assumptions can be made about where they are in the node hierarchy or how many of them there are, and thus every single StaticGeometry node has to assume it's the only one seeing that specific list of instance transforms and uploads a separate copy of the list to the GPU to be drawn.

I shouldn't have to specify that this is a MASSIVE waste of memory bandwidth, especially on integrated GPUs and handheld devices (i.e. Steam Deck) where system memory speeds are... not the greatest to start with.

Unfortunately, a proper solution to this is non-trivial. I have a long-range plan to "flatten" the node hierarchy when rendering, allowing model shaders to pull in animated MatrixTransform data from an array stored uniform buffer rather than modifying the draw transform for every node, but that's somewhat orthogonal to the problem here.

I'd like to properly solidify the assumption set by Loader that LOD switches happen at the root of a model's hierarchy, moving the LOD-related code out of the LOD node and into the Model itself. That would allow the Model to generate and upload an instance buffer at top-level, and forward the instance buffer handle to the child geometry nodes. Some special handling would still be necessary for ModelNodes, but it'd be better than the current system.

Your thoughts and suggestions would be much appreciated, @fluffyfreak!

- Instead of setting the global transform to identity inside StaticGeometry, allow the caller to determine what the "base" transform should be
- Return viewport-normalized coordinates in 0..1 instead of pixel coordinates in 0..w, 0..h
- Add deprecation notice to ProjectToScreen overloads that read renderer internal state and return direct pixel coordinates
- The projection matrix is quite useful to know in a lot of places
- Will eventually replace asking the Renderer for its internal projection matrix state
- WorldView was only user, and the logic was duplicated in Graphics::ProjectToScreen
- If w-component of clip space is 0, all components of returned viewport coordinates will be +-inf - no need to introduce additional flow control
- Allows drastically simplifying Frustum class
- modelview / projection matrices are completely unnecessary now, only the planes are required for frustum testing.
- TranslatePoint was a complete kludge and its purpose is better served by an analytic solution.
- Allow constructing independent float/double versions of Graphics::Frustum where needed for speed or memory bandwidth concerns
- Precalculate instance transforms for each potentially-visible building at startup rather than every single frame.
- Move all matrix multiplication out of the inner loop and do frustum culling with an optimized single-precision float frustum structure.
- Pass a CameraContext rather than a Frustum to CityOnPlanet::Render for top-level frustum culling.
- Add comments indicating potential future performance improvements.
@sturnclaw sturnclaw added Performance C++ code Rendering Everything related to rendering labels Oct 31, 2024
@sturnclaw sturnclaw requested a review from fluffyfreak October 31, 2024 23:41
@fluffyfreak
Copy link
Contributor

Oh boy yes this is exciting, there's a lot here but I'll take a look at this asap 👍
A quick read through and the code looks excellent as always!

@fluffyfreak
Copy link
Contributor

Tested this in Windows but not had chance to on Pi yet, it works perfectly though. Not seen any issues with it so it'a already good to merge 👍

Copy link
Contributor

@fluffyfreak fluffyfreak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good code and I approve of all the changes. Lots of cleanup going on in here of stuff that's needed tidying up and improving for a long time. Great job!

@fluffyfreak
Copy link
Contributor

Testing on Pi shows no real difference at the moment, maybe 1 or 2 FPS but I didn't have time to dig deeper tonight 👍

@sturnclaw
Copy link
Member Author

That sounds about right - I expect the CPU-side cost wasn't the limiting factor here when running on Very Low city detail.

@fluffyfreak I'm going to merge this, but please feel free to provide feedback here on the proposed change to model LODs.

If you want, you can export a JSON frame profile (with ProfilerZoneOutput=1 in config) from the Pi and I'll take a gander and see if there are any particular low-hanging fruit related to CPU performance that we could address - the "worst" device I have access to right now is more than good enough to run Pioneer at 60FPS on low settings.

@sturnclaw sturnclaw merged commit 2ac32b7 into pioneerspacesim:master Nov 8, 2024
4 checks passed
@sturnclaw sturnclaw deleted the render-instanced branch November 8, 2024 19:41
@fluffyfreak
Copy link
Contributor

Unfortunately on RPi compiling with -DPROFILER_ENABLED=ON produces a SIGSEGV, Segmentation fault. Just before reaching the menu, or sometimes getting that far but then doing it when you try to go ingame 🤷

Info: LoadSurfaceFromFile: models/cockpits/default_cockpit/default_cockpit_norm.png: could not read file
Info: Created shader sphereimpostor (address=0x5555605ebc00)
Info: Listening for headtracking packets on 127.0.0.1:4242

Thread 1 "pioneer" received signal SIGSEGV, Segmentation fault.
Profiler::Caller::ForEachByRefProfiler::Caller::foreach::SoftReset (mapto=..., this=0x55555beec7b0) at /home/fluffyfreak/pioneer/contrib/profiler/Profiler.cpp:686
686 mapto( mBuckets[ i ] );
(gdb) bt
#0 Profiler::Caller::ForEachByRefProfiler::Caller::foreach::SoftReset (mapto=..., this=0x55555beec7b0) at /home/fluffyfreak/pioneer/contrib/profiler/Profiler.cpp:686
#1 Profiler::Caller::ForEachProfiler::Caller::foreach::SoftReset (mapto=..., this=0x55555beec7b0) at /home/fluffyfreak/pioneer/contrib/profiler/Profiler.cpp:698
#2 Profiler::Caller::SoftReset (this=0x55555beec7b0) at /home/fluffyfreak/pioneer/contrib/profiler/Profiler.cpp:808
#3 Profiler::resetThreads () at /home/fluffyfreak/pioneer/contrib/profiler/Profiler.cpp:1512
#4 0x000055555591f074 in Profiler::reset () at /home/fluffyfreak/pioneer/contrib/profiler/Profiler.cpp:1658
#5 0x00005555555b9d50 in Application::Run (this=0x555555b4a110) at /home/fluffyfreak/pioneer/src/core/Application.cpp:234
#6 0x00005555555b0378 in main (argc=, argv=0x7fffffffef18) at /home/fluffyfreak/pioneer/src/Pi.h:144
(gdb)

The Profiler also seems to cause issues sometime on Visual Studio/Windows where it will soft lock on the PROFILE_SCOPED() within void RunHandler(double delta, const std::string &handler) and then other times it works just fine.

@fluffyfreak
Copy link
Contributor

fluffyfreak commented Nov 9, 2024

Speaking of Profiler issues, to build with Profiling enabled on Windows I had to change Frame.cpp like this:

src/Frame.cpp | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/Frame.cpp b/src/Frame.cpp
index 7abeb275e..931bb64fc 100644
--- a/src/Frame.cpp
+++ b/src/Frame.cpp
@@ -305,7 +305,7 @@ void Frame::CollideFrames(void (*callback)(CollisionContact *))
if (!frame.m_collisionSpace)
continue;

- PROFILE_SCOPED_DESC(frame.m_label.c_str())
+ PROFILE_SCOPED_RAW(frame.m_label.c_str())
frame.m_collisionSpace->Collide(callback);
}
}

@sturnclaw
Copy link
Member Author

Hmm - is this with zone profiling enabled or not? I'll take a look at it, but I suspect some platform specific weirdness.

@fluffyfreak
Copy link
Contributor

With zone profiling enabled, without it enabled profiling does pretty much nothing on Windows.

@fluffyfreak
Copy link
Contributor

fluffyfreak commented Nov 11, 2024

on Pi, with zone profiling enabled, I can get to the main menu but going into a game causes the SIGSEGV segmentation fault listed above.

EDIT: Setting the number of worker threads to 1 just causes an immediate SIGSEGV before the main menu. It seems to be quite unpredictable which suggests some interaction with the threading.

@fluffyfreak
Copy link
Contributor

@Web-eWorks yeah looking through the instanced rendering within the scenegraph it's pretty horrible. I'm not sure that I understand you're proposal yet but I'm still thinking about it

@sturnclaw
Copy link
Member Author

Hah! I found the crash. If USE_CHRONO is defined, Profiler.cpp completely disables its threading support. I'll work on a PR to enable multithread-with-std::chrono and move the discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ code Performance Rendering Everything related to rendering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants