Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial skeleton for LTTng tracing in navigation2 #2788

Closed
wants to merge 1 commit into from

Conversation

vmayoral
Copy link

Basic Info

This contribution aims to bring the first few building blocks of LTTng instrumentation to the ROS 2 navigation stack. It aligns with ros-perception/image_pipeline#717 which added instrumentation to a perception stack package and also with REP-2008 PR benchmarking architecture.

Particularly, in a similar fashion to what was discussed here for image_pipeline, this PR forks tracetools and creates a new ROS 2 package named nav2_tracetools which hopes to include only the meta data and TRACEPOINTS for navigation2. This should increase the granularity level of the LTTng instrumentation making it more easy to maintain (instead of having to send PRs to tracetools and iterate in longer cycles).

The current PR adds the building blocks but no package has been instrumented just yet to simplify the review and testing of these initial changes. First candidates for instrumentation are costmaps and controllers as indicated in past WG meetings. Such instrumentation can come in follow up PRs with analysis examples of the probes and resulting LTTng traces, as well as documentation extensions on how to benchmark such components with practical examples.

Info Please fill out this column
Ticket(s) this addresses New feature, similar to ros-perception/image_pipeline#717
Primary OS tested on Ubuntu 20.04
Robotic platform tested on ROS 2 Rolling (so None, really)

Description of contribution in a few bullet points

  • Creates nav2_tracetools package within the navigation2 meta-package
  • Adds basic building blocks for LTTng instrumentation
  • Proposes a new tracepoint provider named ros2_navigation2 (i.e. all traces related to nav2 could be filtered by using this tracepoint provider). Similarly, launch files can easily enable/disable traces of nav2 using this tracepoint provider.

Description of documentation updates required from your changes

No direct documentation changes required as part of this PR. Once first navigation2 components get instrumented, it'd be nice to have that aligned with documentation tutorials that show how to make use of it.


Future work that may be required in bullet points

To simplify review of changes, this PR is limited to the initial skeleton of the required changes for LLTng integration into nav2. The following work should be delivered in upcoming PRs:

  • Instrument navigation2 components. Initial targets:
    • costmaps
    • controllers
  • Documentation
    • Create an introductory section of benchmarking in the documentation that explains a rationale similar to what's proposed at REP-2008 PR.
  • Document how to instrument a navigation2 component
  • Document how to trace and analyze the results of an instrumented LTTng component

For Maintainers:

  • Check that any new parameters added are updated in navigation.ros.org
  • Check that any significant change is added to the migration guide
  • Check that any new features OR changes to existing behaviors are reflected in the tuning guide
  • Check that any new functions have Doxygen added
  • Check that any new features have test coverage
  • Check that any new plugins is added to the plugins page
  • If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists

@mergify
Copy link
Contributor

mergify bot commented Jan 27, 2022

@vmayoral, please properly fill in PR template in the future. @SteveMacenski, use this instead.

  • Check that any new parameters added are updated in navigation.ros.org
  • Check that any significant change is added to the migration guide
  • Check that any new features OR changes to existing behaviors are reflected in the tuning guide
  • Check that any new functions have Doxygen added
  • Check that any new features have test coverage
  • Check that any new plugins is added to the plugins page
  • If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists

@vmayoral
Copy link
Author

@vmayoral, please properly fill in PR template in the future. @SteveMacenski, use this instead.

I left it as requested in the template. Just curious, why did mergify complain?

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 27, 2022

It's being really flaky, sorry, you're fine. It's like wack-a-mole at times with the regex for detecting if its changed. I test a change, it works, then a week later it comes back... I should just remove it but I feel like if I keep seeing it I'll eventually get to fixing it up again.

@SteveMacenski
Copy link
Member

Does this have any impact on licensing and/or the licenses which this package can be made available under? I see LTng is a GPL-variant license which makes me immediately suspect about an integration here where everything is BSD or Apache (except AMCL but for historical reasons we can't do much about that).

I'm not clear as to why there's a fork or what specific changes were made relative to that fork. Can you shed some light here? In other words: if tracetools is general, why do we need a Nav2 version of it? Other than the file names and namespaces, I don't see anything "nav2-y" here.

The last question is more abstract - why does this need to be in the Nav2 stack proper? I wonder if tooling for performance characterization / bringing to light issues should be in an auxiliary repository within ros-planning (ex nav2_performance or nav2_tracing or something). We have a few auxiliary repos for different project subsets that don't make sense to be in the main codebase itself like our website, tutorials, and in-the-works advanced features

@vmayoral
Copy link
Author

vmayoral commented Jan 28, 2022

Does this have any impact on licensing and/or the licenses which this package can be made available under? I see LTng is a GPL-variant license which makes me immediately suspect about an integration here where everything is BSD or Apache (except AMCL but for historical reasons we can't do much about that).

The contributions are licensed as Apache 2.0 in alignment with ros2_tracing. I don't there's any conflict with that but I'll defer to @christophebedard and @iluetkeb who may know more about it as the original authors of the LTTng integration.

I'm not clear as to why there's a fork or what specific changes were made relative to that fork. Can you shed some light here? In other words: if tracetools is general, why do we need a Nav2 version of it?

That's argued here and here. I'll paste the most important parts:

The rationale behind this comes from the the experience of having worked for a few months with the ROS 2 tracing infrastructure. To me, it eventually become rather cumbersome to have to maintain a single tracetools repo with all tracepoints (I experienced this directly with an extension of tracetools_acceleration we keep internally). Altogether this led me to realize that it might be better to simply add a tracetools_<whatever-ros2-stack/package>.

From @christophebedard:

This looks good to me. For reference, a similar rationale wrt the ROS 2 core itself was brought up a few months ago here: ros2/rclcpp#1738 (comment). If these were tracepoints for the ROS 2 core (i.e., currently rclcpp and below), I'd prefer to avoid forking tracetools/putting the tracepoints in other packages to avoid the usual fragmentation of open source code, but this isn't really the case. It's more of an extension, i.e., additional instrumentation for application-level layers above the core, so I think this is the way to go for these use-cases. Besides, there is nothing about LTTng that requires instrumentation (or, rather, tracepoints) to be centralized like it is currently for the ROS 2 core, and the interactions with the Trace action are not really affected by this choice.*

Other than the file names and namespaces, I don't see anything "nav2-y" here.

There's no specific instrumentation introduced just for specific nav2 packages just yet, just the core of the LTTng integration. Specific instrumentation will need to be added in follow up PRs based on what we decide to research. We could add that in here if that's needed. My first approach has been incremental given the novelty of the topic for nav2 (and also because reviewing will become easier that way). See here for an example instrumenting image_proc's rectify. We could take the same route in here for costmaps and controllers if that's preferred.

The last question is more abstract - why does this need to be in the Nav2 stack proper? I wonder if tooling for performance characterization / bringing to light issues should be in an auxiliary repository within ros-planning (ex nav2_performance or nav2_tracing or something). We have a few auxiliary repos for different project subsets that don't make sense to be in the main codebase itself like our website, tutorials, and in-the-works advanced features

I think there's merit in this proposal (and that was my first thought as well producing tracetools_acceleration, which aims to be generic for all hardware acceleration repos). The proposed code changes in here boil down to a new ROS 2 package called nav2_tracetools. One could argue that this same package could be turned into planning_tracetools and then all subprojects within the ros-planning org. depend on it.

This could be done if there's consensus at the ros-planning level but note that, 1) nav2 (and all other subprojects of ros-planning) will still need to have a build dependency on such package (e.g. planning_tracetools) and 2) nav2 code still will need to be instrumented in here. Given 2), my thinking is that it'll be faster to add value to nav2 short-term with the current proposed approach but I'm ok trying to bring things up at the ros-planning level if there's enough support behind that idea.

@iluetkeb
Copy link

iluetkeb commented Jan 28, 2022

Does this have any impact on licensing and/or the licenses which this package can be made available under?

lttng-ust, which is what is being linked by ros2_tracing if enabled, is LGPL-licensed, and all its headers are MIT licensed, see
https://github.com/lttng/lttng-ust/blob/master/LICENSE

The LGPL is well understood and entirely compatible to link with with Apache-2.0 and BSD code. Since all headers are MIT-licensed, there is also no problem from including them.

btw, you're also asking what the potential benefit is. For that it may interest you that the old latency/phase-shift analysis I did for the ROS1 nav stack was based on custom tracepoints. With that, I could show that due to a lack of synchronization between the map updater and the planners, use of new perception data in planning had an extremely large jitter (up to 200ms in our experiments). Again in our experiments this reduced accuracy of obstacle avoidance by 5cm (meaning, we had to include an additional 5cm tolerance left and right of the robot, on top of the tolerance necessary due to gridmap resolution and sensor accuracy, etc, to prevent collisions). And this was a for a slow-moving robot (0.5m/s). Just imagine what it would do to a faster robot. This issue had been in the ros1 navstack for ages (in fact, it still is), and it took tracing to find it.

And regarding performance: The Linux kernel is full of tracepoints. They use a bit different technology, but fundamentally the same approach. When you turn them all on, you definitely notice it, particularly for heavy things like tracing the scheduler which can have millions of events per second. But you only do that when you're measuring, and all other measurement methods have even worse impact. That's why everybody is profiling using tracing these days. When turned off, the impact of tracepoints is almost zero.

@SteveMacenski
Copy link
Member

SteveMacenski commented Feb 2, 2022

Altogether this led me to realize that it might be better to simply add a tracetools_<whatever-ros2-stack/package>.

OK, but what does that look like here? Right now, I don't see anything Nav2 specific and I'm not sure what adding tracepoints actually looks like in the context of this package. I'm looking to get an idea of what I'm getting myself into here. Looking at your image pipeline example makes me think that this is incredibly intrusive to the source code which I'm not a fan of unless these are removable broadly and only inserted when being actively used. Adding a bunch of these tracepoints in the source code would make some files much less readable with the flow of logic actually occurring. But I could see a situation where we have none in by default, but the ability to add them easily for users. Hence, the ask "what does this look like".

Another good question is - how much maintenance does something like this need and who's doing that maintenance? I suppose is if these are more specific to packages, would it be sensible to instead have tracepoints/ directories within the packages themselves to manage their own?

Note that I have not used this tool before, valgrind / GDB have served me well so far. So a bit of this is me getting up to speed on this. I'm not opposed to additional options, but I want the value of any software added to this project to be clear and immediately useful to some stakeholder. I don't add things "just because" based on a perceived need. I want at least one specific stakeholder I can point to that is going to use this work in their system or impact change in the project to ultimately merge it.

It has been mentioned a few times that tracepoints add "essentially" no overhead. Do you have some metrics on what that means? Adds 0.1%, 1%, 10%, etc? Is this something that can be compiled into it only when used with a flag versus always being in the background? That would also help decouple a centralized planning_tracepoint package outside of this repo be used (since its not a build requirement, only a requirement when setting a flag to true).

Do you have a "tracing for idiots" tutorial link that I can potentially use to learn about this in the more abstract format before thinking about all of the ROS-y and Nav2-y specifics? I think step 1 here is just for me to know "what is this thing, why is it better than existing tools, and how can I use it". (cough and write a tutorial about it for both my own memory and for others). No matter what, I think we'll need a pretty detailed tutorial for the website to go along with this package so that users can know how to leverage it. That would be analogous to the image pipeline (or any other packages you've added it to) so it can be a central location for "working with tracepoints in ROS2"-style tutorial. Which would be nice to have 😄

Then to @iluetkeb's comment

For that it may interest you that the old latency/phase-shift analysis I did for the ROS1 nav stack was based on custom tracepoints. With that, I could show that due to a lack of synchronization between the map updater and the planners, use of new perception data in planning had an extremely large jitter (up to 200ms in our experiments). Again in our experiments this reduced accuracy of obstacle avoidance by 5cm (meaning, we had to include an additional 5cm tolerance left and right of the robot, on top of the tolerance necessary due to gridmap resolution and sensor accuracy, etc, to prevent collisions). And this was a for a slow-moving robot (0.5m/s). Just imagine what it would do to a faster robot. This issue had been in the ros1 navstack for ages (in fact, it still is), and it took tracing to find it.

Do you have a link to that? That sounds familiar, but jogging my memory now in context with details would help this discussion.

I would like to use that as a baseline to measure this metric on Nav2 as well to see what's up. Fundamentally, the costmap and planner relationship hasn't really changed so I suspect that is still the case, so it would be good to see that and understand what other types of things tracing might be valuable to expose in the project.

Edit: Another good question is how this would impact windows / mac users

@vmayoral
Copy link
Author

vmayoral commented Feb 2, 2022

OK, but what does that look like here? Right now, I don't see anything Nav2 specific and I'm not sure what adding tracepoints actually looks like in the context of this package. I'm looking to get an idea of what I'm getting myself into here. Looking at your image pipeline example makes me think that this is incredibly intrusive to the source code which I'm not a fan of unless these are removable broadly and only inserted when being actively used. Adding a bunch of these tracepoints in the source code would make some files much less readable with the flow of logic actually occurring. But I could see a situation where we have none in by default, but the ability to add them easily for users. Hence, the ask "what does this look like".

@SteveMacenski this was indicated above, see https://github.com/ros-perception/image_pipeline/pull/717/files#diff-5e106d08b6f4807955c147444021dc9e2a50dc414e0a5555428cf7c59f51ea74R81-R93 for a similar example but targeting the image_pipeline package of the perception stack.

Again also, kindly note

The current PR adds the building blocks but no package has been instrumented just yet to simplify the review and testing of these initial changes. First candidates for instrumentation are costmaps and controllers as indicated in past WG meetings. Such instrumentation can come in follow up PRs with analysis examples of the probes and resulting LTTng traces, as well as documentation extensions on how to benchmark such components with practical examples.

If needed, those can be added here. We can pick a target and add a few probes. I just splitted the work adding first the skeleton and later additional probes over the skelethon for a) a first pass/assessment on your side (which is what's happening in here) and b) simplifying the review process.

Another good question is - how much maintenance does something like this need and who's doing that maintenance? I suppose is if these are more specific to packages, would it be sensible to instead have tracepoints/ directories within the packages themselves to manage their own?

Very little maintenance in my experience. Once the probes are introduced into the source code, they shouldn't be modified (unless we move into newer LTTng version that introduce API changes, which is not very likely I believe). I'm happy to help with the maintenance with the spare cycles that I have.

I suppose is if these are more specific to packages, would it be sensible to instead have tracepoints/ directories within the packages themselves to manage their own?

That's the way it works. We need a ROS 2 package that provides the tracepoint definitions (that's the proposed nav2_tracetools) and then, each one of these tracepoints is used in its corresponding package.

It has been mentioned a few times that tracepoints add "essentially" no overhead. Do you have some metrics on what that means? Adds 0.1%, 1%, 10%, etc?

See @christophebedard and @iluetkeb's https://arxiv.org/pdf/2201.00393.pdf, it's a good read for answers on the overhead. Tracing is disabled by default unless you have LTTng installed in your workstation.

See https://gitlab.com/ros-tracing/ros2_tracing#tracing for more details on its operation. I personally prefer triggering tracing session from launch files (which frankly, it's pretty cool). See https://github.com/ros-acceleration/acceleration_examples/blob/main/nodes/faster_doublevadd_publisher/launch/vadd_capture.launch.py#L36-L50 for an example.

Is this something that can be compiled into it only when used with a flag versus always being in the background? That would also help decouple a centralized planning_tracepoint package outside of this repo be used (since its not a build requirement, only a requirement when setting a flag to true).

Yes, but I'd discourage this for a development tree. My experience so far indicates that you want your tracing close to your core source code (that's why this PR is in here and not against https://gitlab.com/ros-tracing/ros2_tracing.

Note that most of the ROS 2 packages these days (official distros) ship with tracing. The impact is minimal (read more about it in the instrumentation design). For production environments, you can disable tracing at build-time as follows:

colcon build --cmake-args " -DTRACETOOLS_DISABLED=ON"

Note LTTng introduces huge advantages when compared to other benchmarking/tracing alternatives that operate at the userspace. LTTng does it from the kernelspace.

Regarding documentation comments, +1, that was proposed above as well:

Future work that may be required in bullet points

To simplify review of changes, this PR is limited to the initial skeleton of the required changes for LLTng integration into nav2. The following work should be delivered in upcoming PRs:

  • Instrument navigation2 components. Initial targets:
    • costmaps
    • controllers
  • Documentation
    • Create an introductory section of benchmarking in the documentation that explains a rationale similar to what's proposed at REP-2008 PR.
  • Document how to instrument a navigation2 component
  • Document how to trace and analyze the results of an instrumented LTTng component

@SteveMacenski
Copy link
Member

SteveMacenski commented Feb 8, 2022

Sorry, its taken me some time to think this over and try to come up with some concise and clear thoughts.

Is including this all that is needed before being able to include and use tracepoints in the source code itself?

If so, what I'm thinking now is the following

  • Merge this either in this repo or in another in ros-planning that can be used when required
  • Add a tutorial about how to add tracepoints and work with them to complete some practical task in Nav2 / ROS 2
  • Have a compile flag with Nav2 that defaults to off that folks can flip in order to compile with tracing
  • or compile it with tracepoints by default but include no tracepoints in the stack itself by default

This gives folks the ability to use it if they want, but not impact the source code if they're not actively working through an issue. It provides the documentation required to set it up so they can work with it, similar to the traceback tutorial so that we don't have to randomly guess what areas are most useful for users to have tracepoints added to for figuring out various issues. I don't want to be dealing with a flood of PRs to add tracepoints to every nook and cranny of the codebase on a constant basis based on the specific debug needs of a specific user. If they're working out a problem, they should have targeted tracebacks to work through that issue.

Its not as much about run-time impact (though I would want to understand that) as much as it is about maintenance load it could add and decreasing code readability for contributors. I can get used to anything (heck, 5 years ago I wasn't even writing tests and now I write tests and documentation like its nothin'), but making sure that things are clean, fluid, and readable is an important point to keeping the barrier of entry low to get community users to contribute back and get involved in the project.

I could, however, be convinced to add tracepoints into the stack itself if the following is met, which doesn't have to happen immediately

  • There exists already a tutorial about what tracepoints are, how to work with them in ROS 2 / Nav2, and an applied example that is sufficiently low-barrier to entry it can be expected that new users can work with them (at least theoretically), similar to the traceback tutorial
  • There's a user of Nav2 that is requesting these to be added by default and they would derive value from that - to have a direct 'customer' to this work. I don't mind at all if you advertise / promote to the community this to find a customer, just as long as there is one.
  • There's a clear policy we can reasonably reliably apply to where these get added so that there is criteria about where and how they are added, not just willy-nilly. For example first thing in all non-debug subscriber or action callbacks, on the major API interfaces, etc. Something that restricts the scope to the most useful data and is principled enough to guide maintenance decisions. Certainly it can be expanded or refined over time,

Basically, I want a user, a process, and the documentation for a typical user to be able to leverage it so the feature would reap benefits for the userbase. Does that sound fair? That would resolve my concerns described above.

@vmayoral
Copy link
Author

vmayoral commented Feb 8, 2022

Basically, I want a user, a process, and the documentation for a typical user to be able to leverage it so the feature would reap benefits for the userbase. Does that sound fair? That would resolve my concerns described above.

This sounds fair to me but it may take some time to get it all together.

There exists already a tutorial about what tracepoints are, how to work with them in ROS 2 / Nav2, and an applied example that is sufficiently low-barrier to entry it can be expected that new users can work with them (at least theoretically), similar to the traceback tutorial

This is the first thing I believe we should tackle to gain acceptance of tracing within the community. Beyond Nav2, I think this is of interest for everyone in the community so I'd make it ROS 2 generic. @christophebedard, would you like to partner and put something like this together for ROS 2? I know there exists one really good resource1 for ROS you created in the past. Would you be open to work together on this and put time to renew it for ROS 2? What's the best location for it? The ros2_tracing repo?

I particularly like the idea of the analysis @christophebedard did around a simple "pipeline" of nodes. This intersects a bit with some of my "day work" at Xilinx so I can probably put quite time on it soon-ish.

Footnotes

  1. https://christophebedard.com/ros-tracing-message-flow/

@christophebedard
Copy link

christophebedard commented Feb 8, 2022

Basically, I want a user, a process, and the documentation for a typical user to be able to leverage it so the feature would reap benefits for the userbase. Does that sound fair? That would resolve my concerns described above.

This sounds fair to me but it may take some time to get it all together.

I agree with all of the above.

@christophebedard, would you like to partner and put something like this together for ROS 2?

I agree that we need some good documentation, with explanations and tutorial(s) for any and all contributors, to keep the barrier of entry as low as possible.

We do have some documentation/tutorials, like this tutorial in the RTWG documentation, but the scope is fairly limited. I think it's best to re-use & link to existing resources whenever possible. For tracing in general, it could be this one. Simply adding that link to the ros2_tracing RTWG tutorial would probably be a good start.

This is similar to what we want to do for real-time-related documentation (ros-realtime/community#32).

I particularly like the idea of the analysis @christophebedard did around a simple "pipeline" of nodes. This intersects a bit with some of my "day work" at Xilinx so I can probably put quite time on it soon-ish.

I'm working on an equivalent & (big) improvement for ROS 2 😜 but unfortunately that means I can't really commit to spending much time on this documentation effort in the near future. At least not until May.

However, I created an issue over on the ros2_tracing repo to define the scope of this effort: https://gitlab.com/ros-tracing/ros2_tracing/-/issues/138. Having input from people less familiar with tracing would be very helpful.

@SteveMacenski
Copy link
Member

SteveMacenski commented Feb 8, 2022

I think a tutorial for the Nav2 or ROS 2 docs should look something like

  • What is tracing in simple terms - what kinds of info do you get out of it and why do we care
  • How are tracepoints defined
  • How are they used/setup in ROS 2 packages (e.g. things like nav2_tracing)
  • Example from an ecosystem package like Nav2 or TF or something adding tracepoints to the project to benchmark something more than a toy example but less than a complex full-system analysis
  • Example of running and looking over the results, how to read the analysis and pull out the pertinent information

Essentially tracing 101 like https://navigation.ros.org/tutorials/docs/get_backtrace.html

@iluetkeb
Copy link

iluetkeb commented Feb 9, 2022

If you guys can come up with a use-case (a particular kind of analysis) that would be particularly interesting to the Nav2 community, Bosch CR could might be able to help implementing and documenting it.

It won't be short-term, more like in the next 6-9 months, but judging from the discussion you could live with that time-frame, right?

@SteveMacenski
Copy link
Member

SteveMacenski commented Feb 9, 2022

That sounds fantastic. Absolutely no rush, most of the work too in Nav2 is usually scheduled around that kind of time block considering most of the contributors are volunteers working on spare time. Any analysis on any timetable you're able to is A-OK with me.

I think a few things would be interesting to understand

  • Similar to your experiment on costmap updates vs planners and controllers using that information to understand what issues we might have there and potential solutions - particularly on solution ideas
  • Understanding with the distributed action server design of Nav2 any jitter or other oddities that's injecting into the system both in a composed single process or as multiple processes (potentially on other CPUs on a LAN)
  • With respect to the behavior tree execution, modeling and understanding better how data is flowing and if we have any bottlenecks or problematic nodes hogging tick durations

Certainly some analysis in the algorithms are always welcome to help us improve them, but I'd say that that would take a backseat to the higher level architectural elements above. While I'm sure there are places we can improve performance in the algorithms, that's not where jitter / blocking / etc derives from that such analysis would bring to light.

@SteveMacenski
Copy link
Member

SteveMacenski commented Mar 24, 2022

Happy to reopen, but I don't like stale things sitting around while there are action items. I think this discussion, when its ready to be continued, should be in a separate ticket and then we can reopen / open a new PR as appropriate from the outcome of those discussions. There are still yet a few documentation / explanation / role questions to be answered that are outside of this PR / repo so this is blocked for the immediate future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants