Some debug tests like are UnavailableHaltedTest broken #520

aap-sc · 2023-11-08T01:06:15Z

@timsifive , @en-sc , @AnastasiyaChernikova FYI.

The effects of the issue can be observed in riscv-collab/riscv-openocd#928 (though this commit has nothing to do with the issue, the root cause seems to be in the test suite itself).

Basically, If I take current riscv-openocd TOT most Unavailable... tests begin to fail in my environment. I've debugged the issue for a while now and I think I understand the root cause.

The problem is with def command(self, cmd) function from debug/testlib.py. What this function does is it tries to obtain results of OpenOCD command by parsing OpenOCD log instead of fetching the result directly (more on that later). Moreover, instead of sending only one command - we actually send 2 "commands". One real command and another is an artificial "command-like" comment to enforce OpenOCD to print prompt-like construct. The issue is that this approach does not always work.

Here are some examples (taken from log files after the failure).

Example 1

Debug: 5781 4300 riscv.c:3097 riscv_openocd_poll(): [riscv.cpu] should_remain_halted=0, should_resume=0
Debug: 5782 4300 target.c:3034 handle_target(): [riscv.cpu] target_poll() -> 0, next attempt in 100ms
targets
^@Debug: 5783 4313 command.c:152 script_debug(): command - targets
    TargetName         Type       Endian TapName            State       
--  ------------------ ---------- ------ ------------------ ------------
 0* riscv.cpu          riscv      little riscv.cpu          halted

> Debug: 5784 4399 riscv.c:3041 riscv_openocd_poll(): [riscv.cpu] Polling all harts.
Debug: 5785 4399 riscv.c:2864 riscv_poll_hart(): [riscv.cpu] polling, target->state=2
Debug: 5786 4399 riscv-013.c:393 scan(): 40b r 00000000 @11 -> + 00000000 @11; 0i
Debug: 5787 4399 riscv-013.c:403 scan(): dmstatus=0 { version=none, ...
Debug: 5788 4399 riscv-013.c:393 scan(): 40b - 00000000 @11 -> + 00430382 @11; 0i
Debug: 5789 4399 riscv-013.c:403 scan(): dmstatus=0 { version=none, ...}
Debug: 5790 4399 riscv.c:1664 riscv_flush_registers(): [riscv.cpu] Flushing register cache
Debug: 5791 4399 riscv.c:1682 riscv_flush_registers(): [riscv.cpu] Flush of register cache completed
Debug: 5792 4399 riscv.c:3097 riscv_openocd_poll(): [riscv.cpu] should_remain_halted=0, should_resume=0
Debug: 5793 4399 target.c:3034 handle_target(): [riscv.cpu] target_poll() -> 0, next attempt in 100ms
# 7x
^@
> Debug: 5794 4499 riscv.c:3041 riscv_openocd_poll(): [riscv.cpu] Polling all harts.
Debug: 5795 4499 riscv.c:2864 riscv_poll_hart(): [riscv.cpu] polling, target->state=2

Prompt for # 7x never matched.

Example 2

Another, more hilarious example is the following:

> targets
^@Debug: 8476 2452 command.c:152 script_debug(): command - targets
Debug: 8477 2452 riscv.c:3041 riscv_openocd_poll(): [riscv.cpu0] Polling all harts.
Debug: 8478 2452 riscv.c:2864 riscv_poll_hart(): [riscv.cpu0] polling, target->state=2
...
Info : 8493 2452 riscv.c:2974 riscv_poll_hart(): [riscv.cpu1] became unavailable.
[riscv.cpu1] became unavailable.
...
Debug: 8516 2453 target.c:3034 handle_target(): [riscv.cpu1] target_poll() -> 0, next attempt in 100ms
    TargetName         Type       Endian TapName            State       
--  ------------------ ---------- ------ ------------------ ------------
 0  riscv.cpu0         riscv      little riscv.cpu          halted
 1* riscv.cpu1         riscv      little riscv.cpu          unavailable

> # 88x
^@
> Debug: 8517 2552 riscv.c:3041 riscv_openocd_poll(): [riscv.cpu0] Polling all harts

Here the prompt is matched. But the output is not parsed correctly, leading to the following error:

Traceback (most recent call last):
  File "/home/ecco/projects/sc_openocd/build/Debug/RISCVTests/debug/testlib.py", line 1333, in run
    result = self.test()    # pylint: disable=no-member
             ^^^^^^^^^^^
  File "/home/ecco/projects/sc_openocd/build/Debug/RISCVTests/debug/gdbserver.py", ...
    self.server.set_available(
  File "/home/ecco/projects/sc_openocd/build/Debug/RISCVTests/debug/testlib.py", line 515, in set_available
    if target["State"] == "unavailable":
       ~~~~~~^^^^^^^^^
KeyError: 'State'

This happens because targets state is parsed like this:

[
   {'TargetName': 'riscv.cpu0', 'Type': 'riscv', 'Endian': 'little', 'TapName': 'riscv.cpu', 'State': 'halted'},
   {'TargetName': 'riscv.cpu1', 'Type': 'riscv', 'Endian': 'little', 'TapName': 'riscv.cpu', 'State': 'halted'},
   {'TargetName': 'Info', 'Type': ':', 'Endian': '5749', 'TapName': '3031', 'State': 'riscv.c:2974'}, 
   {'TargetName': '\x08\x08[riscv.cpu1]', 'Type': 'became', 'Endian': 'unavailable.'}
]

Note: these are but a few examples. The number of situations resulting in test failure is quite large. Though it all comes down to incorrect parsing of the result.

TLDR

Looks like internal debug logging and processing of user commands have their respected IO facilities desynchronized (most likely they use separate buffering)
Obtaining results of command via parsing of log files between 2 "commands" is a bad idea (lot's of things can happen between the first and second "commands") as can be observed in Example 2.

Proposed solution:

Instead of parsing OpenOCD log files (and connecting to a pipe for an input) I suggest to communicate with OpenOCD via TCL-rpc server. TCL-RPC server is almost the same that ordinary telnet server, but it is more suitable for machine-machine interaction (we have a special watermark to indicate message end). We don't need to parse OpenOCD logs to get instruction results - we can fetch the result string directly.

P.S: I'll try to propose a patch to address the issue. This may take some time, though. So if anyone has an urgent desire to fix this - don't be shy ).

The text was updated successfully, but these errors were encountered:

aap-sc · 2023-11-08T09:29:46Z

Oh, and by the way. I think the reason I can reproduce it reliably is that we always run these tests with "-d3" passed to OpenOCD. So the amount of debug logs is increased. This may be the reason why people could have difficulties with reproducing this.

They aren't always reliable. See #520.

timsifive · 2023-11-08T17:48:28Z

FWIW, even with -d3 this doesn't reproduce for me locally.

…rsing log file Quick and dirty fix for riscv-software-src#520

aap-sc · 2023-11-08T17:56:57Z

The patch with the suggested fixes: #522

…rsing log file Quick and dirty fix for riscv-software-src#520

aap-sc · 2023-11-10T18:51:51Z

Looks like the patch fixes this issue, so closing the ticket

aap-sc mentioned this issue Nov 8, 2023

target/riscv: cache requests to trigger configuration riscv-collab/riscv-openocd#928

Merged

timsifive added a commit that referenced this issue Nov 8, 2023

debug: Disable Unavailable tests.

9e114cd

They aren't always reliable. See #520.

timsifive mentioned this issue Nov 8, 2023

debug: Disable Unavailable tests. #521

Closed

aap-sc added a commit to aap-sc/riscv-tests that referenced this issue Nov 8, 2023

debug: use TCL-RPC to fetch results of OpenOCD commands instead of pa…

3e381ec

…rsing log file Quick and dirty fix for riscv-software-src#520

aap-sc mentioned this issue Nov 8, 2023

debug: use TCL-RPC to fetch results of OpenOCD commands instead of parsing log file #522

Merged

aap-sc added a commit to aap-sc/riscv-tests that referenced this issue Nov 9, 2023

debug: use TCL-RPC to fetch results of OpenOCD commands instead of pa…

17297a9

…rsing log file Quick and dirty fix for riscv-software-src#520

aap-sc added a commit to aap-sc/riscv-tests that referenced this issue Nov 9, 2023

debug: use TCL-RPC to fetch results of OpenOCD commands instead of pa…

a4d9f97

…rsing log file Quick and dirty fix for riscv-software-src#520

aap-sc closed this as completed Nov 10, 2023

aap-sc mentioned this issue Apr 21, 2024

Random/sporadic failures running test suite interactively against Spike riscv-collab/riscv-openocd#1049

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some debug tests like are UnavailableHaltedTest broken #520

Some debug tests like are UnavailableHaltedTest broken #520

aap-sc commented Nov 8, 2023 •

edited

Loading

aap-sc commented Nov 8, 2023

timsifive commented Nov 8, 2023

aap-sc commented Nov 8, 2023

aap-sc commented Nov 10, 2023

Some debug tests like are UnavailableHaltedTest broken #520

Some debug tests like are UnavailableHaltedTest broken #520

Comments

aap-sc commented Nov 8, 2023 • edited Loading

aap-sc commented Nov 8, 2023

timsifive commented Nov 8, 2023

aap-sc commented Nov 8, 2023

aap-sc commented Nov 10, 2023

aap-sc commented Nov 8, 2023 •

edited

Loading