Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Fix Segfault in ISMC #36

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

amarburg
Copy link
Collaborator

Changes Made

While doing to "joystick teleop" demo, I had frequent segfaults from controller_manager, unpredictably after 10-60 seconds. After building with ReleaseWithDebInfo and running in GDB, segfault was at line 262:

auto *current_system_state = system_state_.readFromRT();

....

reset_twist_msg(*current_reference);

where current_system_state is a shared_ptr<Twist>. Handling a pointer to a shared_ptr does not increment the refcount, so I think this resulted in the shared_ptr being reaped while this function is still running.

This MR replaces the two instances of this template with:

auto current_system_state(*system_state_.readFromRT());

....

reset_twist_msg(current_reference);

which instantiates a shared_ptr for the duration of this scope.

It's possible this approach is not the most C++20-ish way to do it, happy to address it other ways.

Associated Issues

n/a (did not open issue re segfault)

Testing

Completed joystick teleoperation for ~10+ minutes without segfault on current rolling image.

@amarburg amarburg marked this pull request as draft August 22, 2024 20:57
@amarburg
Copy link
Collaborator Author

amarburg commented Aug 23, 2024

The change has slowed the frequency, but didn't completely eliminate the segfaults. Here is a gdb backtrace (after ~30 minutes of running in sim) after recompiling velocity_controllers with CMAKE_BUILD_TYPE=RelWithDebInfo

(gdb) bt
#0  0x00007ffff7103963 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release_last_use_cold() () from /usr/lib/x86_64-linux-gnu/libspdlog.so.1.12
#1  0x00007fffe7f980e5 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>)
    at /usr/include/c++/13/bits/shared_ptr_base.h:1068
#2  std::__shared_ptr<geometry_msgs::msg::Twist_<std::allocator<void> >, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>,
    __in_chrg=<optimized out>)
    at /usr/include/c++/13/bits/shared_ptr_base.h:1524
#3  std::shared_ptr<geometry_msgs::msg::Twist_<std::allocator<void> > >::~shared_ptr (this=<optimized out>, __in_chrg=<optimized out>)
    at /usr/include/c++/13/bits/shared_ptr.h:175
#4  velocity_controllers::IntegralSlidingModeController::update_reference_from_subscribers (this=<optimized out>)
    at /home/ubuntu/ws_blue/src/auv_controllers/velocity_controllers/src/integral_sliding_mode_controller.cpp:276
#5  0x00007ffff7707e55 in controller_interface::ChainableControllerInterface::update(rclcpp::Time const&, rclcpp::Duration const&) ()
   from /opt/ros/rolling/lib/libcontroller_interface.so
#6  0x00007ffff7f11c6b in controller_manager::ControllerManager::update(rclcpp::Time const&, rclcpp::Duration const&) ()
   from /opt/ros/rolling/lib/libcontroller_manager.so
#7  0x0000555555558ae9 in ?? ()

Given RealtimeBuffer stores pointers internally, I wonder if using shared_ptrs as the datatype for the buffer is misuse of the API.

Another option is that there's memory corruption elsewhere.

@evan-palmer
Copy link
Contributor

Sorry it has taken me so long to get to this. I'm working to confirm that I can replicate the issue.

@evan-palmer
Copy link
Contributor

I did some testing using just auv_control_demos and didn't observe the issue there. I'm going to test it with the teleop demo next.

@evan-palmer
Copy link
Contributor

evan-palmer commented Sep 17, 2024

How often do you encounter this error @amarburg? I'm trying to replicate the problem, but haven't had success yet.

EDIT: Do you observe any memory leaks?

@amarburg
Copy link
Collaborator Author

I was able to replicate it consistently (consistently enough to spend some time debugging it). In general, it would happen fairly promptly (first 5-10 minutes of operation) but during some testing it could take hours before occuring.

@amarburg
Copy link
Collaborator Author

When I have a quiet moment I'll see if I can still replicate it with images from main

@amarburg
Copy link
Collaborator Author

amarburg commented Sep 18, 2024

When I have a quiet moment I'll see if I can still replicate it with images from main

I was able to replicate this morning with:

In window 1

  1. Clone blue
  2. cd blue/.docker/compose
  3. docker compose -f nvidia-desktop.yaml pull
  4. docker compose -f nvidia-desktop.yaml run blue

In window 2

  1. cd blue/.docker/compose && docker compose -f nvidia-desktop.yaml exec blue /bin/bash
  2. ros2 launch blue_demos bluerov2_demo.launch.yaml use_sim:=True

In window 3

  1. cd blue/.docker/compose && docker compose -f nvidia-desktop.yaml exec blue /bin/bash
  2. ros2 launch blue_demos bluerov2_controllers.launch.yaml use_sim:=True

In window 4

  1. cd blue/.docker/compose && docker compose -f nvidia-desktop.yaml exec blue /bin/bash
  2. ros2 launch blue_demos joy_teleop.launch.yaml

Then drive ROV around with gamepad.

Walk away for a while to do some other work.

Come back and ROV is no longer controllable.

Relatively silent error in windows 3 (bluerov_controllers.launch.yaml):

[ERROR] [ros2_control_node-2]: process has died [pid 2075, exit code -11, cmd 
'/opt/ros/rolling/lib/controller_manager/ros2_control_node --ros-args --params-file 
/home/ubuntu/ws_blue/install/blue_demos/share/blue_demos/control_integration/config/bluerov2_controllers.yaml -r 
/controller_manager/robot_description:=/robot_description'].'

@evan-palmer
Copy link
Contributor

I repeated the steps that you took above and let the simulator run for ~2.5 hours without any errors. It may be worth looking into writing some unit tests with the realtime_buffer to see if there are any test cases that can replicate/isolate the issue. What are the specs of the machine that you are running on?

@amarburg
Copy link
Collaborator Author

I repeated the steps that you took above and let the simulator run for ~2.5 hours without any errors. It may be worth looking into writing some unit tests with the realtime_buffer to see if there are any test cases that can replicate/isolate the issue. What are the specs of the machine that you are running on?

Over the course of testing, I was able to replicate this behavior on three different machines, an Intel laptop and Intel Desktop (i9-9980HK, 32GB, no GPU) and an AMD desktop (3950X, 64GB, 1650Super).

@amarburg
Copy link
Collaborator Author

Given I'm away from keyboard for at least a week, let's park this. Go ahead with #41, and I'll retest after that's gone through...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants