-
-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize CityOnPlanet CPU-side rendering cost #5943
Conversation
- Instead of setting the global transform to identity inside StaticGeometry, allow the caller to determine what the "base" transform should be
- Return viewport-normalized coordinates in 0..1 instead of pixel coordinates in 0..w, 0..h - Add deprecation notice to ProjectToScreen overloads that read renderer internal state and return direct pixel coordinates
- The projection matrix is quite useful to know in a lot of places - Will eventually replace asking the Renderer for its internal projection matrix state
- WorldView was only user, and the logic was duplicated in Graphics::ProjectToScreen - If w-component of clip space is 0, all components of returned viewport coordinates will be +-inf - no need to introduce additional flow control - Allows drastically simplifying Frustum class
- modelview / projection matrices are completely unnecessary now, only the planes are required for frustum testing. - TranslatePoint was a complete kludge and its purpose is better served by an analytic solution.
- Allow constructing independent float/double versions of Graphics::Frustum where needed for speed or memory bandwidth concerns
- Precalculate instance transforms for each potentially-visible building at startup rather than every single frame. - Move all matrix multiplication out of the inner loop and do frustum culling with an optimized single-precision float frustum structure. - Pass a CameraContext rather than a Frustum to CityOnPlanet::Render for top-level frustum culling. - Add comments indicating potential future performance improvements.
Oh boy yes this is exciting, there's a lot here but I'll take a look at this asap 👍 |
Tested this in Windows but not had chance to on Pi yet, it works perfectly though. Not seen any issues with it so it'a already good to merge 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good code and I approve of all the changes. Lots of cleanup going on in here of stuff that's needed tidying up and improving for a long time. Great job!
Testing on Pi shows no real difference at the moment, maybe 1 or 2 FPS but I didn't have time to dig deeper tonight 👍 |
That sounds about right - I expect the CPU-side cost wasn't the limiting factor here when running on Very Low city detail. @fluffyfreak I'm going to merge this, but please feel free to provide feedback here on the proposed change to model LODs. If you want, you can export a JSON frame profile (with |
Unfortunately on RPi compiling with -DPROFILER_ENABLED=ON produces a SIGSEGV, Segmentation fault. Just before reaching the menu, or sometimes getting that far but then doing it when you try to go ingame 🤷
The Profiler also seems to cause issues sometime on Visual Studio/Windows where it will soft lock on the |
Speaking of Profiler issues, to build with Profiling enabled on Windows I had to change
|
Hmm - is this with zone profiling enabled or not? I'll take a look at it, but I suspect some platform specific weirdness. |
With zone profiling enabled, without it enabled profiling does pretty much nothing on Windows. |
on Pi, with zone profiling enabled, I can get to the main menu but going into a game causes the SIGSEGV segmentation fault listed above. EDIT: Setting the number of worker threads to 1 just causes an immediate SIGSEGV before the main menu. It seems to be quite unpredictable which suggests some interaction with the threading. |
@Web-eWorks yeah looking through the instanced rendering within the scenegraph it's pretty horrible. I'm not sure that I understand you're proposal yet but I'm still thinking about it |
Hah! I found the crash. If |
As is usual for this codebase, you can't change one thing without touching a dozen other unrelated files.
This PR addresses some of the low hanging optimization fruit in CityOnPlanet. A short summary is that it moves as much of the matrix math out of the frustum culling inner loop as possible, turning it into a simple test-and-copy loop.
The total effect of these optimizations is that the frustum-cull-and-copy pass went from ~220us to ~100us on my machine, representing anything from a 20% to 40% speedup to the overall performance cost of
CityOnPlanet::Render()
.The PR also does a decent amount of cleanup and tidying in other parts of the codebase. Of particular note is that we no longer have two duplicate implementations of world-space to screen-space projection - this is all handled by
Graphics::ProjectToScreen<T>
now. As a consequence,Graphics::Frustum
is now 192 bytes, down from ~480 bytes. I've also replacedFrustum::TranslatePoint()
's incremental approximation with an actual analytic solution to scaling a point to fit within the far plane of the frustum.@fluffyfreak I invite you to test this with your Raspberry Pi branch - I suspect you'll be a bit less CPU-bound when rendering cities, and might even be able to bump up the detail level a bit.
There is another major performance optimization that could be done in the future - currently, we create a separate instance buffer and upload all of this matrix data for every single StaticGeometry node in each
SceneGraph::Model
that we're rendering instances of.This is because LOD switches are "just another node" and they split the array of instance transforms into smaller subarrays that are dispatched to each child group. Because they're "just another node", absolutely no assumptions can be made about where they are in the node hierarchy or how many of them there are, and thus every single StaticGeometry node has to assume it's the only one seeing that specific list of instance transforms and uploads a separate copy of the list to the GPU to be drawn.
I shouldn't have to specify that this is a MASSIVE waste of memory bandwidth, especially on integrated GPUs and handheld devices (i.e. Steam Deck) where system memory speeds are... not the greatest to start with.
Unfortunately, a proper solution to this is non-trivial. I have a long-range plan to "flatten" the node hierarchy when rendering, allowing model shaders to pull in animated MatrixTransform data from an array stored uniform buffer rather than modifying the draw transform for every node, but that's somewhat orthogonal to the problem here.
I'd like to properly solidify the assumption set by Loader that LOD switches happen at the root of a model's hierarchy, moving the LOD-related code out of the LOD node and into the Model itself. That would allow the Model to generate and upload an instance buffer at top-level, and forward the instance buffer handle to the child geometry nodes. Some special handling would still be necessary for ModelNodes, but it'd be better than the current system.
Your thoughts and suggestions would be much appreciated, @fluffyfreak!