Performance testing and optimization #114

kring · 2021-01-27T00:48:34Z

This is a time-boxed (10 days) task to examine various aspects of the cesium-native and Cesium for Unreal performance to measure where it spends its time/memory and look for ways to improve it. Including:

Memory usage
CPU usage
GPU usage
Number of tiles requests

We should look for ways to instrument the tile load pipeline in order to collect statistics, such as:

how long tiles take to download a tile
how long it takes to parse a glTF
how long it takes to decode Draco buffers and images
how long it takes to upsample tiles for raster overlays
total time from "tile first identified as needed" to "tile ready to render"

Unreal Engine has a nice system for collecting this kind of information without a major impact on performance, but we can't easily use that from cesium-native. What can we do instead?

javagl · 2021-01-28T01:01:06Z

Some questions to consider here: Which information can/should be gathered "always", and which information should be gathered only in dedicated benchmark/performance comparison runs*? Should these measurements be enabled purely programmatically (if (args.contains("doTiming")) ...), or via compile-time flags (#ifdef DO_TIMING), or should the information be gathered completely non-invasively - i.e. with external tools. How is the information supposed to be processed further? Are log messages sufficient, or should it be written to a file in a machine-processable, accumulated form?

Some information could probably be gathered by some simple wrappers around the interfaces. As pseudocode example, some CountingAssetAccessor:

class CountingAssetAccessor : public IAssetAccessor {
    IAssetAccessor *delegate;
    int counter;
    Request requestAsset(String url, Vector headers) {
        counter++;
        if (counter % 100) log("Did " + counter + " requests until now");
        return delegate.requestAsset(url, headers);
    }
}

to count the number of tile requests, which could just be sneaked into the TilesetExternals, optionally and programmatically, without affecting any other part of the code.

Another obvous, simple option for some cases could be to add some coarse-grained wall-clock-time checks like

Time before = now();
Image image = decode(image);
LOG_FINEST_LEVEL("That one took "+(now()-before));

around some calls. This could, to some extent, be even pushed further down, with more fine-grained timings/counters...

struct TraversalTimingDetails { 
    ...
   long accumulatedTimeForX;
   long accumulatedTimeForY;
   long numberOfX;
   long numberOfY;
}

... that can be used to print, e.g. the average time that was spent for culling or MSE computations or whatnot. But of course, this can quickly become pretty complex, and for a certain level of detail, this does not make sense any more, and a dedicated profiler run may be necessary.

* When talking about "performance comparison runs", one thing that I currently find hard to imagine is how these tests can be made reproducible. For example, trying to optimize a detail of the traversal in cesium-native by trying to create a camera flight in Unreal (that is, roughly speaking, always supposed to request the same tiles) may be challenging. I played with some possible "pure cesium-native" tests, but only as a side task.

baothientran · 2021-03-25T21:11:39Z

So below is the initial performance measure for our Gltf loader (draco and image too) and terrain Upsample. I would need to figure out how to capture those dataset per frame and measure how long a tile is waiting for its content to be downloaded. To display the trace.json file that is attached at the bottom, you can open it in the chrome by going to about:tracing. Then click the load button on the top left of the screen. After that, it will display the performance data across threads. There are 4 vertical buttons floating around the content panel that allows you to zoom in to see the measure more clearly. All of them are in the profiler branch currently. It should display like this:

@kring or @javagl do you have any feedback on it?
trace.json.txt

kring · 2021-03-25T22:19:44Z

@baothientran I'm not totally sure what I'm looking at here... but I loaded the trace.json in about:tracing. Is my understanding correct that a bunch of images took ~4 seconds to load? And that happened multiple times across threads, but only approximately once per thread? That seems really surprising, I can't explain why that would happen.

baothientran · 2021-03-26T00:51:37Z

@kring yup. The asset measured above is Melbourne. I'm surprised too that the decoding image takes longer than decoding draco. I put a trace inside stb_image and it looks like it spends the most time in there. I also measure our code before and after calling stb_image 's function and they are very quick. So I'm not sure how stb_image decodes jpeg very slow. Let me confirm real quick on that with max and min and see what happen

javagl · 2021-03-26T15:01:23Z

It's hard to directly compare that, but a backlink to #106 (comment) may be interesting here. At least, it shows that stb_image can indeed be a bottleneck. There, it was for PNG files, IIRC, but maybe it has performance caveats in general. A more detailed analysis may be necessary here. For example: Which image (exactly) is taking so long, compared to one that did not take long? Roughly: Was that specific image particularly large, or did it have some odd compression settings, or ... ...

More generally: The first task that is addressed here is setting up the infrastructure for the performance tests. We did not talk about the levels of integration that I mentioned in my first comment. But apparently, the focus right now is on integrating the tracing system on the level of compile-time flags. (I haven't looked into the details yet, but ... is that the profiler from the asset-pipeline, c.f. https://app.slack.com/client/T4ATYJZD5/CU7DYAA65/thread/CU7DYAA65-1612329074.105700 ?)

However, to repeat from my first comment: For a detailed analysis and comparison of profiling results (regardless of how they are gathered technically), we will have to make sure that the the runs are reproducible. Such a trace right now may show up certain specific issues (like stb_image being slow ~"sometimes"), but it will be hard to "zoom" into the problematic function systematically.

I think that it could make sense to use some of the test infrastructure that is in https://github.com/CesiumGS/cesium-native/tree/main/Cesium3DTiles/test right now to load a dedicated tileset from file (maybe even pre-loaded, to avoid the measurements being distorted by caching), and do some purely programmatic "camera flight" (similar to (but more elaborate than) what I did in https://github.com/CesiumGS/cesium-native/pull/127/files#diff-6ab4645380113fe3ee5d74c4fb3322e6bd9f4ce69dfff880d03bb492a52d6689R38 ) to reliably measure the performance impact by comparing "before the change" and "after the change" traces.

(Again, these points are rather "next steps". Having some profiling infrastructure set up is the precondition for all of that. One could argue that we already could use the sophisticated profiling+tracing functionality of Visual Studio once we have the capability to run some cesium-native-performance-test.exe that is independent of cesium-unreal, but on the other hand, these profiling capabilities are somewhat independent of each other)

baothientran · 2021-03-26T22:52:08Z

@javagl yup it is from our asset-pipeline's tracer currently. The problem with microbenchmark like the one that is setup currently in Cesium3DTiles/tests is that it maybe not accurate. It is because on the unreal side, we use unreal concurrency and Http system, so what unreal does behind the scene with those system can affect the download time and the performance of loading tiles. Though profiling like that is hard to reproduce, one can mitigate it by loading a tileset and moving around the area to gather some data, after that do some statistic like average or median per frame or something. Do it for several runs and see what happens (Well I don't know what I'm talking about here, just some random ideas in my head).

baothientran · 2021-03-26T22:59:42Z

Also the advantage of the asset-pipeline tracer is that it is only needed when we compile with TRACING_ENABLED macro. Otherwise, there will be no code emit for profiling part. If we want to detect which tile is the culprit that causes the performance problem, I would imagine we can use more temporary intrusive method to look into more detail

javagl · 2021-03-29T13:03:18Z

Two points from your previous comments that I'd like to emphasize:

I think it should be possible to do at least some performance test runs with a "standalone" version of cesium-native. One reason is what you already mentioned: We don't know exactly what Unreal does internally, in terms of threading, request handling, or the things that are done when "preparing rendering resources". So we should have a possibility to run these tests independently of Unreal, focussing on our own code. Another (IMHO equally important) reason: When we can start the test run in such a "standalone" executable, then we can (more easily) use the Visual Studio profiler. (Remotely related: That's one of the reasons why I wondered whether we should really handle file:// URLs via Unreal - we'll need an own FileAssetAccessor at some point anyhow...)
The second point is that of "statistical data". While it can make sense to exclude "all the Unreal engine stuff" for certain test runs and for dedicatedly zooming into one of our code paths, all our performance improvements might be eaten up by the parts that are actually costly: Namely, the IO and "preparing renderer resources" (and these are solely handled by Unreal). Regardless of whether we can actually influence the performance of these parts, it should be possible to analyze that. Roughly: It would be cool to have a "Start trace"-button in the UI, then click the "Fly through Melbourne"-Button to do a dedicated, predefined camera flight, and then hit "Stop trace", to (reproducibly) receive something like the trace.json that you posted, and do some further analysis on that - for example, printing the "min/max/avg" of things like 1. request times, 2. IO times, 3. render-prepare times, 4. etc.

jtorresfabra · 2021-04-12T11:50:17Z

Just stumbled across this one. Just wanted to say that I hit with the stb_image slowness wall before, and the issue is even worse when running in wasm (#76 (comment)).

kring · 2021-06-16T11:42:42Z

Quick update, mostly just as an excuse to share some nifty charts...

prepareInLoadThread (the Unreal, non-game-thread part of preparing a glTF for rendering) is taking a surprisingly long time for the Melbourne tileset. Here's a chart of the distribution of execution times for Cesium World Terrain:

It's produced in chrome://tracing by rubber band selecting the CWT loading processes and select prepareInLoadThread and then clicking the Duration radio button.

So it takes under 2ms on average, and the worst case is ~46ms.

Now here's the Melbourne tileset in the same run (I had Actors for both datasets):

Here the average is over 100ms and the longest tiles take close to a second. This is pretty surprising because the function does not include:

Any waiting for I/O or for threads to become available.
Draco decoding.
Image decoding.
Creating Unreal Engine UObjects or renderer resources.

It does include physics cooking, though, so that's my best guess about what's taking all the time.

Here's the complete trace that can be loaded in chrome://tracing:
tracer.zip

baothientran · 2021-06-16T11:58:50Z

@kring Is the number for when the app in the game mode? If it’s in the editor mode, another source that can cause the slow down maybe that swapping may occur due to Garbage Collection not running on the editor. In the past, after I force gc.CollectEveryFrame 1, I can see higher frame rate in the editor

kring · 2021-06-16T12:01:43Z

That's a release build running in the editor. It's basically just the initial load of the two tilesets, on a 32GB system, so there shouldn't be any swapping. But I'll check it out to be sure.

kring · 2021-06-16T23:50:16Z

PhysX cooking averages 46ms per Melbourne tile. Compared to 1.3ms per CWT tile.
Computing tangents with MikkTSpace averages 32ms per Melbourne tile. Compared to 0.9ms per CWT tile.

Are the Melbourne tiles just that much more complex?

kring · 2021-06-17T10:16:52Z

On the MikkTSpace side, I think we can gain substantial performance just by... not doing it.

Tangents are required when we're using a normal map, because normal maps are expressed in tanget space. I thought that they were required more generally, because there doesn't appear to be a way to exclude tangents entirely from a static mesh. But if we just set them to the zero vector (which is a lot faster than running MikkTSpace), I can't see any rendering differences at all. We should probably add an "Always Include Tangents" option, though, in case users are using a custom material that requires them.

A very nice side benefit of not generating tangents is that a model that has normals but doesn't have a normal map will no longer need to have vertices duplicated to avoid sharing them between triangles. Which means less memory usage and less GPU load. Models without normals will still need to have vertices duplicated, though, so that we can generate flat normals as required by the glTF spec.

baothientran · 2021-06-17T12:21:10Z

Do you think by any chance Unreal generate even normal for us too? It would be interesting to see if lightning change at all if we remove normal. If unreal does generate normal automatically, we may get more performance by only using indexed triangle instead of flattening them out currently (sorry didn’t notice you already said it in the last paragraph)

kring · 2021-06-17T12:27:49Z

We can't remove normals, they're always in the static mesh. Same with tangents. But we can set the normals to some constant value or whatever instead of computing them properly. When we do that, the lighting falls apart, as you'd expect. AFAIK there's no way to have UE generate them for us. Even if there were, I'd be nervous that it would do it in the game thread and tank performance, like it does when physics meshes are missing (only in the editor, though...outside the editor there's just no collision at all).

It might be nice to have an option that says "when normals are missing, ignore the glTF spec's requirement of flat normals and just compute a normal per vertex from the weighted average of the faces that share that vertex." That should be much more performant and I suspect it'd be fine (or maybe even better) for 90% of cases.

kring · 2021-06-29T05:32:00Z

I wrote some new issues based on the work done here:
CesiumGS/cesium-unreal#509
CesiumGS/cesium-unreal#510
CesiumGS/cesium-unreal#511
CesiumGS/cesium-unreal#512

Performance is a never-ending task, but I think we can close this and write new and more specific performance-releated issues as necessary.

kring added this to the Initial Release milestone Jan 27, 2021

kring mentioned this issue Jan 28, 2021

Move texture preparation off the game thread CesiumGS/cesium-unreal#127

Closed

javagl mentioned this issue Feb 6, 2021

First draft of "integration tests" for Cesium3DTiles #127

Closed

baothientran self-assigned this Mar 24, 2021

kring added performance Improvements to performance, including reductions in memory usage research Explore an idea or prototype a concept and share the results labels Apr 27, 2021

kring mentioned this issue Jun 29, 2021

Make tileset physics/collisions optional CesiumGS/cesium-unreal#509

Closed

kring closed this as completed Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance testing and optimization #114

Performance testing and optimization #114

kring commented Jan 27, 2021

javagl commented Jan 28, 2021

baothientran commented Mar 25, 2021

kring commented Mar 25, 2021

baothientran commented Mar 26, 2021

javagl commented Mar 26, 2021 •

edited

Loading

baothientran commented Mar 26, 2021

baothientran commented Mar 26, 2021

javagl commented Mar 29, 2021

jtorresfabra commented Apr 12, 2021

kring commented Jun 16, 2021

baothientran commented Jun 16, 2021 •

edited

Loading

kring commented Jun 16, 2021

kring commented Jun 16, 2021

kring commented Jun 17, 2021

baothientran commented Jun 17, 2021 •

edited

Loading

kring commented Jun 17, 2021

kring commented Jun 29, 2021

Performance testing and optimization #114

Performance testing and optimization #114

Comments

kring commented Jan 27, 2021

javagl commented Jan 28, 2021

baothientran commented Mar 25, 2021

kring commented Mar 25, 2021

baothientran commented Mar 26, 2021

javagl commented Mar 26, 2021 • edited Loading

baothientran commented Mar 26, 2021

baothientran commented Mar 26, 2021

javagl commented Mar 29, 2021

jtorresfabra commented Apr 12, 2021

kring commented Jun 16, 2021

baothientran commented Jun 16, 2021 • edited Loading

kring commented Jun 16, 2021

kring commented Jun 16, 2021

kring commented Jun 17, 2021

baothientran commented Jun 17, 2021 • edited Loading

kring commented Jun 17, 2021

kring commented Jun 29, 2021

javagl commented Mar 26, 2021 •

edited

Loading

baothientran commented Jun 16, 2021 •

edited

Loading

baothientran commented Jun 17, 2021 •

edited

Loading