Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux de-profile and msvc compilation fix #5967

Merged

Conversation

fluffyfreak
Copy link
Contributor

Two minor changes:

  • Disable the profiler in Debug builds on Linux because you shouldn't be profiling in Debug. The rationale for this is that Debug will behave very differently than Release builds and so the Profiling build is Release + Profiling tools.
  • The 2nd is a compilation fix for MSVC under Profiling or Debug builds where a runtime string was being passed into a Profiling macro which failed to build. I tried a number of things to utilise it but it's the only example in the codebase so switched it out.

- By default, uses sequentially-consistent std::atomic_compare_exchange_weak instead of hand-rolled asm
- USE_CHRONO is approx. 4.3x faster on AMD Ryzen 7 5800X than the hand-rolled rdtsc implementation (50ns per-call overhead vs. 216ns per-call overhead)
- Performance on aarch64 is unknown at this time
@sturnclaw
Copy link
Member

Continuing from #5943 (comment), I've pushed the fix which enables multithreaded profiler builds on non-x86 platforms to this branch as well.

Interestingly enough, when using std::atomics rather than the hand-rolled compare-and-swap implementation, the per-call overhead becomes quite a bit lower - up to 5x lower on my machine, in fact.

I'd appreciate a quick check to ensure that the CAS-lock path has a properly zero-initialized mLock variable on MSVC - I think it should be correctly initialized.

Copy link
Member

@sturnclaw sturnclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on my end.

@fluffyfreak
Copy link
Contributor Author

On Windows with ProfilerZoneOutput=0 I now get a capture with information in it :D
However I still get a crash in HashTable::Reset when ProfilerZoneOutput=1

On RaspberryPi though it works with both and produces a good profile 👍 massive improvement there.

I'll look into the CAS mLock stuff a little later today

@fluffyfreak
Copy link
Contributor Author

I cannot see anything logically wrong with the CAS mLock stuff.

The Windows crash when using zone profiling appears to be some kind of double-delete, maybe, it's really hard to debug even with blocks of #pragma optimize("",off) surrounding chunks of code.

I think for now we take the win of what we've got and tackle that Windows issue in it's own PR

@fluffyfreak fluffyfreak merged commit 2668aa8 into pioneerspacesim:master Nov 15, 2024
4 checks passed
@fluffyfreak fluffyfreak deleted the Linux-de-profile-msvc-fix branch November 15, 2024 17:26
@sturnclaw
Copy link
Member

On RaspberryPi though it works with both and produces a good profile 👍 massive improvement there.

If you have time left today, feel free to zip up a trace (both .json and .html) from the running game on the Pi and send it my way (per the original comment that sparked all this).

@sturnclaw
Copy link
Member

I cannot see anything logically wrong with the CAS mLock stuff.

The Windows crash when using zone profiling appears to be some kind of double-delete, maybe, it's really hard to debug even with blocks of #pragma optimize("",off) surrounding chunks of code.

I think for now we take the win of what we've got and tackle that Windows issue in it's own PR

At risk of being "that guy", this is why I build with profiler enabled in the debug configuration - makes it much easier to find crash bugs.

@fluffyfreak
Copy link
Contributor Author

rpi5-profiler.zip
Zip of profiles, some might not be complete. The last one is sat on the pad on Earth, city and planet detail on Medium. With cockpit enabled at 1280x720

@sturnclaw
Copy link
Member

image

This tells me everything I need to know - we're 100% fully GPU-bound on RPi5. (Specifically, I can only assume that this is serializing on a map-write GL call to update the light buffer used in 95% of drawcalls in the previous frame.)

An individual physics tick takes ~3ms, which is surprisingly Not Bad for that hardware. Other than the GPU catch-up stall, rendering takes about 2.5ms to record command buffers, and another 3.75-4ms to upload those command buffers as OpenGL API calls. (Pending further insight, I'm going to say that means the software command list is Working As Intended.)

I'm actually not sure if there are any external GPU profilers for Raspberry Pi 5. We certainly don't have any in-process GPU profiling, and while it's possible to hack in to our existing profiler implementation, we might be better served by optionally using Tracy as our profiler backend. We'll certainly need to make a few alterations to our rendering architecture to allow embedding timing regions into the draw stream.

@sturnclaw
Copy link
Member

Moving further discussion back to #5700 where it probably should have been posted in the first place 😄

@fluffyfreak
Copy link
Contributor Author

Ah ha! This feels slightly hilarious, but the issue with the Zone profiling on Windows is a double delete and I could not figure out where it was coming from or why because it made no sense.

However this fixes it...

index 0be8f50dc..1ffee2dc3 100644
--- a/contrib/profiler/Profiler.cpp
+++ b/contrib/profiler/Profiler.cpp
@@ -1365,7 +1365,7 @@ namespace Profiler {
        };

        template< class Dumper >
-       void dumpThreads( Dumper dumper, const char *dir ) {
+       void dumpThreads( Dumper &dumper, const char *dir ) {
                PROFILE_SCOPED()
                Caller *accumulate = new Caller( "/Top Callers" ), *packer = new Caller( "/Thread Packer" );
                Buffer<Caller *> packedThreads;

I think it's because the ZoneDumper is being copied into dumpThreads instead of passed by reference.
Need to test this on Linux/RPi but it fixes it in MSVC

@sturnclaw
Copy link
Member

I'll give this a try this afternoon!

@fluffyfreak
Copy link
Contributor Author

Damn, I had to make some other tweaks to make this build on RPi and I suspect would be the same under x64 Linux builds

void dump(const char *dir) { PrintfDumper pd; dumpThreads( pd, dir ); }
void dumptrace(const char *dir) { TraceDumper td; dumpThreads( td, dir ); }
void dumpzones(const char *dir) { ZoneDumper zd; dumpThreads( zd, dir ); }
void dumphtml(const char *dir) { HTMLDumper hd; dumpThreads( hd, dir ); }

@fluffyfreak
Copy link
Contributor Author

Damn, disappointingly this causes a SIGSEGV which I'd have to look into later

@sturnclaw
Copy link
Member

This is a nasty one - the HashTable implementation in Profiler.cpp doesn't properly initialize itself before running a constructor which expects zero-initialized values. There's probably also some weirdness going on where MSVC actually copies the temporary object to pass it to the function, then destroys the initial temporary, leading to the double-delete (as the HashTable stored by ZoneDumper just contains a raw pointer). I can only assume GCC/Clang optimize out that copy, as I've never run into that crash or seen it reported before.

I'm pushing a fix for all of the above to #5970 as 827d42f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants