-
Notifications
You must be signed in to change notification settings - Fork 78
Frame Pointers: Miracle Or Menace?
Look I know there's a ton of Umm Actually's about this stuff but I want to cover the Normal Case here as a baseline!
Ok so each thread of your native program has a Stack that stores the state each function needs. The stack is usually just one big array of bytes, often on the order of 8MB (although maybe lazily mapped), and generally natively understood by the operating system and CPU. When you call a function it logically pushes a "Frame" onto the stack to store that function's state. When you return from a function it logically pops that "Frame" from the stack.
Calling a function basically puts the caller on hold while the callee runs, and when we return control to it some of its state must be restored to the right registers. This often includes:
- The instruction pointer (where the function was in its execution)
- The stack pointer (where the top of its stackframe is on the stack)
- The frame pointer (where the bottom of its stackframe is on the stack)
- Miscellaneous "Callee-Saved Registers" (variables the caller is allowed to assume the callee won't change, a convention that can improve perf)
The absolutely easiest way for the callee to ensure this is to have a "prologue" that always runs at the start and an "epilogue" that always runs at the end. The prologue saves all those values to its stackframe, and the epilogue restores those values from its stackframe. The frame pointer is important to this process because the stack pointer may fluctuate as the function pushes/pops more state, making the offsets to things change constantly. Because the frame pointer never changes, you can just remember fixed offsets from everything.
If every function in a program used the Super Simple prologue/epilogue and preserved the frame pointer at all times, then popping an arbitrary stack frame would be essentially:
- get the frame pointer, which conveniently points at the start of the frame where all saved registers are pushed
- restore all the registers (instruction, stack, frame, callee-saved) based on their known fixed order
(The frame pointer here is in fact just the saved stack pointer, or at least a fixed offset from it.)
You can see how this would make unwinding, backtraces, and debuggers very simple, fast, and reliable. So of course we no longer do this!
The downside of a Fully Standard And Rigid function prologue and epilogue is that it adds a fixed cost to every single function call. And actually, interpreted as strictly as possible it would prevent us from ever inlining a function! And what do we get for it? Faster and simpler "unwinding, backtraces, and debuggers"? Those probably aren't on the hot path of the normal operation of a program. So if there are ways to make the "normal" operation of the program faster at the cost of slowing down or complicating those other things, that sounds worth it!
Here's some typical optimizations that run roughshod over Sensical Stacks:
-
Who even needs a frame pointer!? I can statically remember how big my stack frame is at every point in my execution, I can just compute all the offsets I care about from the stack pointer! Wow that frees up a whole general purpose register!!
-
If the last thing I do is call a function and immediately return its result, why does my stack frame even need to exist anymore? Can't I just replace it with the new function? Tail-call elimination/optimization!!
-
If the platform has a link register (arm, mips) then instead of saving the instruction pointer to memory, we can save it to the link register. Of course this just kicks the can down the road, because now we need to save the link register! Except "leaf" functions which don't call any other functions can avoid saving the link register! Hooray! If they don't need any other on-stack state they can actually have a 0-sized frame and don't need to restore the stack pointer either!
-
If the function doesn't modify a callee-saved register, then it doesn't need to save it, it's already restored!
-
Actually lets get Really Smart and remember that on this path of the function
rax
storescaller_rax + 5
, so we can just restore it by subtracting 5!
All of these tricks are absolutely real things compilers have done at one time or another to make programs faster, and why shouldn't they? Well, because it makes unwinding, backtraces, and debugging WAY SLOWER AND MORE COMPLICATED. Like problematically slow and complicated!
All of these tricks needed to be recorded in "unwinding tables" so they can be compensated for. And you need to remember those adjustments for every address in the binary -- yes every byte in the binary has its own sub-program. And the compiler needs to remember to correctly transform those sub-programs while optimizing the main program. And the unwinder now needs to lookup, parse, and interpret this really fiddly and complex unwinding VM.
But hey, it does indeed make programs run a few % faster, especially on x86 (actual 32-bit) where there's very few general purpose registers!
More seriously, there's four major arguments for frame pointers:
- It would make unwind tables smaller, making binaries smaller (since unwind tables often can't be omitted from production programs unless they are built with
panic=abort
). - It would make compilers/devtools more reliable, as unwind tables can be buggy/missing (apple compact unwind info intentionally emits incorrect unwind tables for prologues/epilogues because the program itself never needs to unwind from those locations, and functions otherwise usually only need one entry for their entire body).
- It makes unwinding, and especially backtraces way faster.
- The runtime cost of frame pointers on modern 64-bit systems with way more registers and more optimized cores is now negligible!
In fact, in the last few years Firefox made a big push to enable frame pointers everywhere it could... because the performance team wanted it! You see, there is one situation where backtraces are absolutely on the hot path: sampling profilers! Most major profilers are based on sampling, which is basically constantly taking backtraces of the program being profiled to determine where it's spending its time (on average).
This is supposed to be a lightweight approach that doesn't interfere too much with the program's performance characteristics, except... it's extremely heavyweight and slow, because unwinding is that pessimized. Also it can produce wrong results in all the places the unwind tables are wrong, because the profiler needs to take backtraces even in places where the actual program would "never" unwind! I believe there was infamously a profiler that managed to significantly optimize its execution by... just copying the entire stack every time it took a sample so it could do the unwinding offline! Yikes!
But with frame pointers enabled, the kind of minimal backtraces that profilers need to do (essentially computing the instruction pointer for every function on the stack), it's just reliably chasing a small linked list stored entirely on the stack. This makes it the profiler more reliable and responsive, and also interferes with the measurement a lot less. If you make all the profiling tools better, the couple % of perf you sacrifice will probably get paid off very quickly by your engineers finding big performance wins! Your users might even do the profiling for you!
I don't have a good segue to bring this up but FrameHop is also a really cool Rust-based backtracer-for-profilers that does a bunch of fancy stuff to basically JIT unwind tables to optimize the execution of samply, the Rust-based sampling profiler. Possibly this makes frame pointers less important.
TODO: flesh this out
- We still want to not-push callee-saved registers we don't need to, so "full" unwinding still needs tables to recover those
- On ARM64, leaf functions can still have inconsistent frame pointers, resulting in skipping over the second frame in the stack (but resuming normally on the third)
- Similar frame skipping can happen if you hit the right part of a function prologue/epilogue where the frame pointer hasn't been swapped yet
For Rust, the flag of interest is -Cforce-frame-pointers=yes. The equivalent GCC/Clang flag is -fno-omit-frame-pointer
. It's just that simple!
I believe this should just be enabled by default on Apple ARM64 platforms because it's required by the platform ABI, except in leaf functions.
Unfortunately, enabling the flag on Windows isn't helpful (or harmful) because Rust matches the behaviour of Microsoft's toolchain, and Microsoft's behaviour for frame pointers is useless for unwinding. Essentially they defined frame pointers to point "somewhere in the middle of the stackframe" so that it could be used to make binaries smaller, because it would make the average (signed) offset to locals smaller (which has weird implications since many cpu instructions can encode small immediates directly).
I don't know if there's a world where Microsoft fixes this, or if ARM64 Windows fixes this, but, I wish it was fixed!
Also note that frame pointers working reasonably in some sense requires every part of your program, including native system DLLs, to cooperate! That said even if you need to go to the unwind tables to confirm each function is using frame pointers, that's still more reliable and efficient than more complicated unwinding instructions!