-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance testing and optimization #114
Comments
Some questions to consider here: Which information can/should be gathered "always", and which information should be gathered only in dedicated benchmark/performance comparison runs*? Should these measurements be enabled purely programmatically ( Some information could probably be gathered by some simple wrappers around the interfaces. As pseudocode example, some
to count the number of tile requests, which could just be sneaked into the Another obvous, simple option for some cases could be to add some coarse-grained wall-clock-time checks like
around some calls. This could, to some extent, be even pushed further down, with more fine-grained timings/counters...
... that can be used to print, e.g. the average time that was spent for culling or MSE computations or whatnot. But of course, this can quickly become pretty complex, and for a certain level of detail, this does not make sense any more, and a dedicated profiler run may be necessary. * When talking about "performance comparison runs", one thing that I currently find hard to imagine is how these tests can be made reproducible. For example, trying to optimize a detail of the traversal in |
So below is the initial performance measure for our Gltf loader (draco and image too) and terrain Upsample. I would need to figure out how to capture those dataset per frame and measure how long a tile is waiting for its content to be downloaded. To display the @kring or @javagl do you have any feedback on it? |
@baothientran I'm not totally sure what I'm looking at here... but I loaded the trace.json in about:tracing. Is my understanding correct that a bunch of images took ~4 seconds to load? And that happened multiple times across threads, but only approximately once per thread? That seems really surprising, I can't explain why that would happen. |
@kring yup. The asset measured above is Melbourne. I'm surprised too that the decoding image takes longer than decoding draco. I put a trace inside |
It's hard to directly compare that, but a backlink to #106 (comment) may be interesting here. At least, it shows that More generally: The first task that is addressed here is setting up the infrastructure for the performance tests. We did not talk about the levels of integration that I mentioned in my first comment. But apparently, the focus right now is on integrating the tracing system on the level of compile-time flags. (I haven't looked into the details yet, but ... is that the profiler from the asset-pipeline, c.f. https://app.slack.com/client/T4ATYJZD5/CU7DYAA65/thread/CU7DYAA65-1612329074.105700 ?) However, to repeat from my first comment: For a detailed analysis and comparison of profiling results (regardless of how they are gathered technically), we will have to make sure that the the runs are reproducible. Such a trace right now may show up certain specific issues (like I think that it could make sense to use some of the test infrastructure that is in https://github.com/CesiumGS/cesium-native/tree/main/Cesium3DTiles/test right now to load a dedicated tileset from file (maybe even pre-loaded, to avoid the measurements being distorted by caching), and do some purely programmatic "camera flight" (similar to (but more elaborate than) what I did in https://github.com/CesiumGS/cesium-native/pull/127/files#diff-6ab4645380113fe3ee5d74c4fb3322e6bd9f4ce69dfff880d03bb492a52d6689R38 ) to reliably measure the performance impact by comparing "before the change" and "after the change" traces. (Again, these points are rather "next steps". Having some profiling infrastructure set up is the precondition for all of that. One could argue that we already could use the sophisticated profiling+tracing functionality of Visual Studio once we have the capability to run some |
@javagl yup it is from our asset-pipeline's tracer currently. The problem with microbenchmark like the one that is setup currently in |
Also the advantage of the |
Two points from your previous comments that I'd like to emphasize:
|
Just stumbled across this one. Just wanted to say that I hit with the stb_image slowness wall before, and the issue is even worse when running in wasm (#76 (comment)). |
Quick update, mostly just as an excuse to share some nifty charts...
It's produced in chrome://tracing by rubber band selecting the CWT loading processes and select So it takes under 2ms on average, and the worst case is ~46ms. Now here's the Melbourne tileset in the same run (I had Actors for both datasets): Here the average is over 100ms and the longest tiles take close to a second. This is pretty surprising because the function does not include:
It does include physics cooking, though, so that's my best guess about what's taking all the time. Here's the complete trace that can be loaded in chrome://tracing: |
@kring Is the number for when the app in the game mode? If it’s in the editor mode, another source that can cause the slow down maybe that swapping may occur due to Garbage Collection not running on the editor. In the past, after I force gc.CollectEveryFrame 1, I can see higher frame rate in the editor |
That's a release build running in the editor. It's basically just the initial load of the two tilesets, on a 32GB system, so there shouldn't be any swapping. But I'll check it out to be sure. |
PhysX cooking averages 46ms per Melbourne tile. Compared to 1.3ms per CWT tile. Are the Melbourne tiles just that much more complex? |
On the MikkTSpace side, I think we can gain substantial performance just by... not doing it. Tangents are required when we're using a normal map, because normal maps are expressed in tanget space. I thought that they were required more generally, because there doesn't appear to be a way to exclude tangents entirely from a static mesh. But if we just set them to the zero vector (which is a lot faster than running MikkTSpace), I can't see any rendering differences at all. We should probably add an "Always Include Tangents" option, though, in case users are using a custom material that requires them. A very nice side benefit of not generating tangents is that a model that has normals but doesn't have a normal map will no longer need to have vertices duplicated to avoid sharing them between triangles. Which means less memory usage and less GPU load. Models without normals will still need to have vertices duplicated, though, so that we can generate flat normals as required by the glTF spec. |
Do you think by any chance Unreal generate even normal for us too? It would be interesting to see if lightning change at all if we remove normal. If unreal does generate normal automatically, we may get more performance by only using indexed triangle instead of flattening them out currently (sorry didn’t notice you already said it in the last paragraph) |
We can't remove normals, they're always in the static mesh. Same with tangents. But we can set the normals to some constant value or whatever instead of computing them properly. When we do that, the lighting falls apart, as you'd expect. AFAIK there's no way to have UE generate them for us. Even if there were, I'd be nervous that it would do it in the game thread and tank performance, like it does when physics meshes are missing (only in the editor, though...outside the editor there's just no collision at all). It might be nice to have an option that says "when normals are missing, ignore the glTF spec's requirement of flat normals and just compute a normal per vertex from the weighted average of the faces that share that vertex." That should be much more performant and I suspect it'd be fine (or maybe even better) for 90% of cases. |
I wrote some new issues based on the work done here: Performance is a never-ending task, but I think we can close this and write new and more specific performance-releated issues as necessary. |
This is a time-boxed (10 days) task to examine various aspects of the cesium-native and Cesium for Unreal performance to measure where it spends its time/memory and look for ways to improve it. Including:
We should look for ways to instrument the tile load pipeline in order to collect statistics, such as:
Unreal Engine has a nice system for collecting this kind of information without a major impact on performance, but we can't easily use that from cesium-native. What can we do instead?
The text was updated successfully, but these errors were encountered: