Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CallstackIsntr: Fix callback types #1447

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from
Open

Conversation

AndrewFasano
Copy link
Contributor

@AndrewFasano AndrewFasano commented Feb 19, 2024

#1445 incorrectly change the callback types without changing how they were registered. Thanks to @LauraLMann and @MarkMankins for pointing this out! I'm slightly terrified that this sort of mistake doesn't cause a compile error (and that all the CI tests still passed)!

I reverted that PR on dev. This PR now includes both those changes + the fix so we can discuss the issue @MarkMankins has found when the intended change is applied.

@AndrewFasano AndrewFasano enabled auto-merge (rebase) February 19, 2024 20:51
@MarkMankins
Copy link
Collaborator

MarkMankins commented Feb 19, 2024

This change combined with #1445 introduces an inconsistency in the output of callstack_instr. Running the same recording multiple times can result in different output from callstack_instr...here's an example of what I'm seeing:

$ diff win2k_ssbranch_string_matches.txt win2k_ssbranch_string_matches_expected.txt                                                       1,5c1,5
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012cfa (asid=0x0e32e000)  1
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012d0b (asid=0x0e32e000)  1
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc394 (asid=0x0e32e000)  2
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc3bc 77f891d2 80465691 8049d72e 80402408 (asid=0x00000000)  2
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc3bc 77f891d2 80465691 8049d947 80402408 (asid=0x00000000)  2
---
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012cfa (asid=0x0e32e000)  1
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012d0b (asid=0x0e32e000)  1
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc394 (asid=0x0e32e000)  2
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc3bc 77f891d2 80465691 8049d72e 80402408 (asid=0x00000000)  2
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc3bc 77f891d2 80465691 8049d947 80402408 (asid=0x00000000)  2

@AndrewFasano
Copy link
Contributor Author

Oh no, I was expecting a difference with output from callstack_instr before/after these changes, but you're saying it's just producing varying output just running repeatedly with these changes?

@MarkMankins
Copy link
Collaborator

Oh no, I was expecting a difference with output from callstack_instr before/after these changes, but you're saying it's just producing varying output just running repeatedly with these changes?

Correct - different consistent output would not have surprised me - but I'm seeing output that varies between different runs of panda replaying the same recording file.

@AndrewFasano
Copy link
Contributor Author

Yikes, that's no good. I'm going to revert #1445 on dev, and rebase all the commits into this branch/PR to keep this self contained.

@lacraig2 you're the SBE/EBE callback expert, do you have any idea why this might occur? Do EBE callbacks get triggered if a block's execution is interrupted?

AndrewFasano pushed a commit that referenced this pull request Feb 19, 2024
Commit c575951 without #1447
was incomplete. And with #1447 it's broken.
@lacraig2
Copy link
Member

@AndrewFasano EBE callbacks do not occur when an exit is requested. SBE/EBE are injected into the TCG stream and only occur when execution reaches that point.

I am skeptical that this is the reason that we're experiencing an issue here. The recordings I've taken have tens of millions of instructions and tend to have tens of exits.

I've done some analysis:

#!/usr/bin/env python3
from sys import argv
from pandare import Panda

# Single arg of arch, defaults to i386
arch = "arm" if len(argv) <= 1 else argv[1]
panda = Panda(generic=arch)
from os.path import exists

recording_name = "rec"+arch
if not exists(f"{recording_name}-rr-snp"):
    print("taking recording")
    @panda.queue_blocking
    def do_stuff():
        panda.revert_sync("root")
        panda.run_monitor_cmd(f"begin_record {recording_name}")
        print(panda.run_serial_cmd("cat /etc/passwd && sleep 2 && cat /etc/passwd && ls -la"))
        panda.run_monitor_cmd("end_record")
        panda.stop_run()
    panda.run()
else:
    print("recording exists. not remaking recording")


bbe = 0
@panda.cb_before_block_exec
def bbed(cpu, tb):
    global bbe
    bbe += 1

abe = 0
num_exit = 0

@panda.cb_after_block_exec
def abed(cpu, tb, ec):
    global num_exit,num_exit_tb_exit
    if ec > 1:
        num_exit += 1
        
    global abe
    abe += 1

sbe = 0
@panda.cb_start_block_exec
def sbed(cpu, tb):
    global sbe
    sbe += 1

ebe = 0
@panda.cb_end_block_exec
def ebed(cpu, tb):
    global ebe
    ebe += 1

panda.enable_tb_chaining()
panda.run_replay(recording_name)

print(f"{bbe=}, {sbe=}")
print(f"Number missing: {bbe-sbe=}")

print(f"{abe=}, {ebe=}")
print(f"Number missing: {abe-ebe=}")

print(f"Starts vs ends: {sbe-ebe=}, {bbe-abe=}")



print(f"{num_exit=} ")

Which produces results like this:

bbe=3128690, sbe=3128678
Number missing: bbe-sbe=12
abe=3126280, ebe=3126280
Number missing: abe-ebe=0
Starts vs ends: sbe-ebe=2398, bbe-abe=2410
num_exit=12

To me this indicates that the block is checking for the exit status ahead of start_block_exec. The fact that this only occurs 12 times in the entire recording makes me skeptical that this is indeed the issue.

This does appear to be something you can check because even if the block is exited early our after_block_exec callback occurs. This means you could simulate this particular condition by enabling after_block_exec as well and only examining the output if the exit condition is greater than 1.

@AndrewFasano
Copy link
Contributor Author

Thanks Luke, that's helpful info. I'm glad the EBE callback isn't called when a block is interrupted, that's a nice improvement over ABE!

I can try building some more tests for this PR later, but I'm not sure how this PR could (in itself) cause the non-deterministic behavior unless something it's built on is more fundamentally broken 😨

The old logic would miss many calls if tb_chaining was enabled (which is
is by default). Since this plugin did not disable tb_chaining, many calls
would be missed if a user didn't disable chaining or load another plugin
that disabled chaining.

This commit updates the plugin to use start_block_exec and end_block_exec
which work even with tb_chaining enabled.
@AndrewFasano AndrewFasano force-pushed the callstack_instr_fix2 branch from ac4e71f to d152c22 Compare March 5, 2024 22:45
@MarkMankins
Copy link
Collaborator

I tested with the latest code and I'm still seeing issues where callstack_instr outputs different results for the same replay.

@AndrewFasano
Copy link
Contributor Author

I haven't forgotten about this, just haven't had time to track down the issue. Things are working correctly on the main dev branch, right?

@MarkMankins
Copy link
Collaborator

Yes, no issues on dev.

@MarkMankins
Copy link
Collaborator

Oh - sorry - didn't realize the recent push was just a rebase - I thought you made some changes - ignore me. :)

@lacraig2
Copy link
Member

lacraig2 commented Mar 6, 2024

I think I know what is going wrong here.

It looks like our implementation of start_block_exec runs prior to the first instruction, but not prior to the initial TCG block elements.

In this case those initial TCG checks mean that if an exit event has occurred we would generate a before_block_exec, but not a start_block_exec.

These are fairly rare, but over the course of millions of blocks it will occur.

I had started testing out a change to start_block_exec that adds functionality earlier in the block, but I ran into some issues with LLVM generation failing. I'll likely follow up with a PR.

@AndrewFasano
Copy link
Contributor Author

@lacraig2 so if that's the issue it would mean the SBE callback is sometimes (non-deterministically) triggered once when it should be triggered twice? Are we going to have to add an exitCode to the EBE callback to indicate if a block was ended early and then just keep the existing callstack instr logic for filtering out double calls.

@lacraig2
Copy link
Member

I think I've found the issue. It seems like SBE should check panda_exit_loop.

I've implemented this as a separate PR #1471, but I have cherry picked it onto this branch so that @MarkMankins can check it with his tests. I have already confirmed that this fixes the issues for the file you sent me.

Once we confirm that this works I'd like to revert this commit and merge #1471 and then this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants