test_communication tests take a long time and provide unclear feedback on error #404

rotu · 2020-03-10T16:38:38Z

Bug report

Required Info:

Operating System:
- All
Installation type:
- All
Version or commit hash:
- 68c495a
DDS implementation:
- All
Client library (if applicable):
- All

Steps to reproduce issue

colcon test --packages-select test_communication

Expected behavior

If failures exist, each test case terminates in a short (<1s) time and reports a relevant failure message (something like "10 messages were sent but 0 were received").

Actual behavior

On failure, the test takes a long time (10s) and the message only reports "timed out waiting for ... to finish". This makes it sound like the receiving process deadlocked. Additionally, the assertion uses a confusing string representation of the launch action object, where the subscriber executable name and arguments would be more appropriate.

https://ci.ros2.org/user/rotu/my-views/view/Extra%20RMW/job/nightly_linux-aarch64_extra_rmw_release/670/testReport/test_communication/TestPublisherSubscriber/test_subscriber_terminates_in_a_finite_amount_of_time_Arrays_/

Traceback (most recent call last):
  File "/home/jenkins-agent/workspace/nightly_linux-aarch64_extra_rmw_release/ws/build/test_communication/test_publisher_subscriber__rclpy__rclcpp__rmw_fastrtps_dynamic_cpp__rmw_cyclonedds_cpp_Release.py", line 66, in test_subscriber_terminates_in_a_finite_amount_of_time
    proc_info.assertWaitForShutdown(process=subscriber_process, timeout=10)
  File "/home/jenkins-agent/workspace/nightly_linux-aarch64_extra_rmw_release/ws/install/launch_testing/lib/python3.6/site-packages/launch_testing/proc_info_handler.py", line 144, in assertWaitForShutdown
    assert success, "Timed out waiting for process '{}' to finish".format(process)
AssertionError: Timed out waiting for process '<launch.actions.execute_process.ExecuteProcess object at 0xffff7e693e80>' to finish

The text was updated successfully, but these errors were encountered:

hidmic · 2020-03-20T17:06:50Z

I fully agree assertion messages can (and should) be made clearer, simply not obscuring subscriber output, where the actual test is. But I don't think:

If failures exist, each test case terminates in a short (<1s) time and reports a relevant failure message (something like "10 messages were sent but 0 were received").

is desirable, let alone achievable (in the short to mid-term). Particularly making tests faster.

Cross-vendor pub sub tests consist of a publisher process sending a fixed sequence of known messages in cycles and a subscriber process asserting on those messages. In that context, launch testing infrastructure plays the same role than ROS 1 rostest, and nothing more. Namely, a process launcher with timeouts and return codes checks, plus test output aggregation (where IMHO launch_testing is particularly bad at).

To assert on the number of sent messages vs. received messages implies there's a synchronization mechanism between launcher, publisher and subscriber processes in place, or otherwise latencies in process creation, process scheduling, DDS participant creation, DDS participant discovery, to name a few, will render the test flakey and unpredictable. A synchronization mechanism that does not currently exist in the framework. We can later discuss whether that should be introduced or not, but it's certainly unlikely to land in the short term.

We could achieve something like what you describe by simplifying the test down to whether a message was received or not, but that would then be a different test.

rotu · 2020-03-20T18:04:21Z

There is a synchronization mechanism implied; all messages must be received within a 10s window from launch for the test to be considered passing. That window seems unreasonably long and could be made much shorter. Considering all test cases I can see pass in < 2 seconds, it seems that window can be reduced.
Yes, it's totally reasonable to have some communication from the listener back to the test. Right now, that communication is an exit status. It could be a file, shared memory, a stream, or any sort of IPC channel.

hidmic · 2020-03-21T17:00:07Z

There is a synchronization mechanism implied; all messages must be received within a 10s window from launch for the test to be considered passing.

Processes start of execution is synchronized, but that's far from enough. That window is a worst case scenario, though AFAIK it's true that no exhaustive search has been conducted to find a lower bound for it.

That window seems unreasonably long and could be made much shorter. Considering all test cases I can see pass in < 2 seconds, it seems that window can be reduced.

Which tests? With which RMW implementation? On which platform? Under what CPU load?

Yes, it's totally reasonable to have some communication from the listener back to the test.

The listener is the test. So I'd refrain from rolling out an IPC just to get the launcher on this test to do the assertion.

Thinking about this again, we could explore having more timeouts in the listener e.g. timeout to first message arrival, timeout to final message arrival, instead of a single, global one, though I'd not dare to guess how small these can be made to get the test passing for all (RMW, OS) combinations. You're more than welcome to contribute an attempt.

claireyywang added the bug Something isn't working label Mar 19, 2020

claireyywang assigned hidmic Mar 19, 2020

rotu mentioned this issue Mar 20, 2020

Prevent subscriber tests from hanging forever if no work #401

Closed

hidmic mentioned this issue Apr 16, 2021

Fix Topic Info Test with "Infinite" printing ros2/ros2cli#616

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_communication tests take a long time and provide unclear feedback on error #404

test_communication tests take a long time and provide unclear feedback on error #404

rotu commented Mar 10, 2020 •

edited

Loading

hidmic commented Mar 20, 2020

rotu commented Mar 20, 2020

hidmic commented Mar 21, 2020

test_communication tests take a long time and provide unclear feedback on error #404

test_communication tests take a long time and provide unclear feedback on error #404

Comments

rotu commented Mar 10, 2020 • edited Loading

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

hidmic commented Mar 20, 2020

rotu commented Mar 20, 2020

hidmic commented Mar 21, 2020

rotu commented Mar 10, 2020 •

edited

Loading