Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some debug tests like are UnavailableHaltedTest broken #520

Closed
aap-sc opened this issue Nov 8, 2023 · 4 comments
Closed

Some debug tests like are UnavailableHaltedTest broken #520

aap-sc opened this issue Nov 8, 2023 · 4 comments

Comments

@aap-sc
Copy link
Collaborator

aap-sc commented Nov 8, 2023

@timsifive , @en-sc , @AnastasiyaChernikova FYI.

The effects of the issue can be observed in riscv-collab/riscv-openocd#928 (though this commit has nothing to do with the issue, the root cause seems to be in the test suite itself).

Basically, If I take current riscv-openocd TOT most Unavailable... tests begin to fail in my environment. I've debugged the issue for a while now and I think I understand the root cause.

The problem is with def command(self, cmd) function from debug/testlib.py. What this function does is it tries to obtain results of OpenOCD command by parsing OpenOCD log instead of fetching the result directly (more on that later). Moreover, instead of sending only one command - we actually send 2 "commands". One real command and another is an artificial "command-like" comment to enforce OpenOCD to print prompt-like construct. The issue is that this approach does not always work.

Here are some examples (taken from log files after the failure).

Example 1

Debug: 5781 4300 riscv.c:3097 riscv_openocd_poll(): [riscv.cpu] should_remain_halted=0, should_resume=0
Debug: 5782 4300 target.c:3034 handle_target(): [riscv.cpu] target_poll() -> 0, next attempt in 100ms
targets
^@Debug: 5783 4313 command.c:152 script_debug(): command - targets
    TargetName         Type       Endian TapName            State       
--  ------------------ ---------- ------ ------------------ ------------
 0* riscv.cpu          riscv      little riscv.cpu          halted

> Debug: 5784 4399 riscv.c:3041 riscv_openocd_poll(): [riscv.cpu] Polling all harts.
Debug: 5785 4399 riscv.c:2864 riscv_poll_hart(): [riscv.cpu] polling, target->state=2
Debug: 5786 4399 riscv-013.c:393 scan(): 40b r 00000000 @11 -> + 00000000 @11; 0i
Debug: 5787 4399 riscv-013.c:403 scan(): dmstatus=0 { version=none, ...
Debug: 5788 4399 riscv-013.c:393 scan(): 40b - 00000000 @11 -> + 00430382 @11; 0i
Debug: 5789 4399 riscv-013.c:403 scan(): dmstatus=0 { version=none, ...}
Debug: 5790 4399 riscv.c:1664 riscv_flush_registers(): [riscv.cpu] Flushing register cache
Debug: 5791 4399 riscv.c:1682 riscv_flush_registers(): [riscv.cpu] Flush of register cache completed
Debug: 5792 4399 riscv.c:3097 riscv_openocd_poll(): [riscv.cpu] should_remain_halted=0, should_resume=0
Debug: 5793 4399 target.c:3034 handle_target(): [riscv.cpu] target_poll() -> 0, next attempt in 100ms
# 7x
^@
> Debug: 5794 4499 riscv.c:3041 riscv_openocd_poll(): [riscv.cpu] Polling all harts.
Debug: 5795 4499 riscv.c:2864 riscv_poll_hart(): [riscv.cpu] polling, target->state=2

Prompt for # 7x never matched.

Example 2

Another, more hilarious example is the following:

> targets
^@Debug: 8476 2452 command.c:152 script_debug(): command - targets
Debug: 8477 2452 riscv.c:3041 riscv_openocd_poll(): [riscv.cpu0] Polling all harts.
Debug: 8478 2452 riscv.c:2864 riscv_poll_hart(): [riscv.cpu0] polling, target->state=2
...
Info : 8493 2452 riscv.c:2974 riscv_poll_hart(): [riscv.cpu1] became unavailable.
[riscv.cpu1] became unavailable.
...
Debug: 8516 2453 target.c:3034 handle_target(): [riscv.cpu1] target_poll() -> 0, next attempt in 100ms
    TargetName         Type       Endian TapName            State       
--  ------------------ ---------- ------ ------------------ ------------
 0  riscv.cpu0         riscv      little riscv.cpu          halted
 1* riscv.cpu1         riscv      little riscv.cpu          unavailable

> # 88x
^@
> Debug: 8517 2552 riscv.c:3041 riscv_openocd_poll(): [riscv.cpu0] Polling all harts

Here the prompt is matched. But the output is not parsed correctly, leading to the following error:

Traceback (most recent call last):
  File "/home/ecco/projects/sc_openocd/build/Debug/RISCVTests/debug/testlib.py", line 1333, in run
    result = self.test()    # pylint: disable=no-member
             ^^^^^^^^^^^
  File "/home/ecco/projects/sc_openocd/build/Debug/RISCVTests/debug/gdbserver.py", ...
    self.server.set_available(
  File "/home/ecco/projects/sc_openocd/build/Debug/RISCVTests/debug/testlib.py", line 515, in set_available
    if target["State"] == "unavailable":
       ~~~~~~^^^^^^^^^
KeyError: 'State'

This happens because targets state is parsed like this:

[
   {'TargetName': 'riscv.cpu0', 'Type': 'riscv', 'Endian': 'little', 'TapName': 'riscv.cpu', 'State': 'halted'},
   {'TargetName': 'riscv.cpu1', 'Type': 'riscv', 'Endian': 'little', 'TapName': 'riscv.cpu', 'State': 'halted'},
   {'TargetName': 'Info', 'Type': ':', 'Endian': '5749', 'TapName': '3031', 'State': 'riscv.c:2974'}, 
   {'TargetName': '\x08\x08[riscv.cpu1]', 'Type': 'became', 'Endian': 'unavailable.'}
]

Note: these are but a few examples. The number of situations resulting in test failure is quite large. Though it all comes down to incorrect parsing of the result.

TLDR

  • Looks like internal debug logging and processing of user commands have their respected IO facilities desynchronized (most likely they use separate buffering)
  • Obtaining results of command via parsing of log files between 2 "commands" is a bad idea (lot's of things can happen between the first and second "commands") as can be observed in Example 2.

Proposed solution:

Instead of parsing OpenOCD log files (and connecting to a pipe for an input) I suggest to communicate with OpenOCD via TCL-rpc server. TCL-RPC server is almost the same that ordinary telnet server, but it is more suitable for machine-machine interaction (we have a special watermark to indicate message end). We don't need to parse OpenOCD logs to get instruction results - we can fetch the result string directly.

P.S: I'll try to propose a patch to address the issue. This may take some time, though. So if anyone has an urgent desire to fix this - don't be shy ).

@aap-sc
Copy link
Collaborator Author

aap-sc commented Nov 8, 2023

Oh, and by the way. I think the reason I can reproduce it reliably is that we always run these tests with "-d3" passed to OpenOCD. So the amount of debug logs is increased. This may be the reason why people could have difficulties with reproducing this.

timsifive added a commit that referenced this issue Nov 8, 2023
They aren't always reliable. See #520.
@timsifive
Copy link
Collaborator

FWIW, even with -d3 this doesn't reproduce for me locally.

@aap-sc
Copy link
Collaborator Author

aap-sc commented Nov 8, 2023

The patch with the suggested fixes: #522

aap-sc added a commit to aap-sc/riscv-tests that referenced this issue Nov 9, 2023
aap-sc added a commit to aap-sc/riscv-tests that referenced this issue Nov 9, 2023
@aap-sc
Copy link
Collaborator Author

aap-sc commented Nov 10, 2023

Looks like the patch fixes this issue, so closing the ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants