tf2_ros::MessageFilter<sensor_msgs::msg::LaserScan>> dropping all messages if tf latency is greater than scan message rate #516

malban · 2022-07-18T02:44:10Z

Required Info:

Operating System:
- Ubuntu 18.04
Installation type:
- source
ROS Version
- galactic
Version or commit hash:
- 99622a3
Laser unit:
- gazebo simulated single plane lidar

Steps to reproduce issue

run slam_toolbox in async mode
publish scan messages at 20Hz
broadcast odom_frame_ to scan frame tf transform with a latency of 50+ ms

Expected behavior

scan messages are processed by slam_toolbox one at a time with a queue size of one, dropping messages as necessary

Actual behavior

all scan messages are dropped

Additional information

It seems like this is an artifact of using the tf message filter and having a queue size of 1. The message filter is supposed to wait until the transform is ready before activating the callback, but before that happens a new message is received. Since the queue size is 1, the callback can never be triggered if the wait time is greater than the period of the messages.

Increasing the queue size is problematic, because then a queue of messages can build up, which is not desired for async mode.

What seems to work fine is to use a regular subscriber with a queue size of 1, and put an explicit canTransform check at the head of the callbacks. The tf timeout can then be set to a user configurable value.

The text was updated successfully, but these errors were encountered:

SteveMacenski · 2022-07-18T17:51:57Z

How fast is your odometry being published? 20hz is exceptionally low. Usually this is between 50-100hz, but often higher.

Even if you had a queue of built up messages to process, I don't think that is a guarantee that this resolves you issue. The queue sizes more helps with not dropping relatively recent data when processing takes too long. The transform timeout for the filter is set by default to 0.5s which should be plenty https://github.com/SteveMacenski/slam_toolbox/blob/ros2/src/slam_toolbox_common.cpp#L159 if you're saying you only have a 50ms delay. Maybe something isn't resolving there correctly?

malban · 2022-07-18T18:17:01Z

The transform timeout for the filter is set by default to 0.5s which should be plenty

The 0.5s timeout is not the problem and is is plenty long enough. The issue is that, with the tf message filter, wait time is effectively capped at the period between scan messages if the message queue size is only 1. If the scan messages are coming in at 20Hz, then the timeout is effectively 50ms.

Even if you had a queue of built up messages to process, I don't think that is a guarantee that this resolves you issue. The queue sizes more helps with not dropping relatively recent data when processing takes too long.

Agreed. The queue size should remain 1.

How fast is your odometry being published? 20hz is exceptionally low. Usually this is between 50-100hz, but often higher.

The odom rate is relatively high. I experience this issue with gazebo when stressing out my dev system, which is resulting in latency in the transforms. When this latency is larger than 50ms, scan messages can't be processed in slam_toolbox.

I'll see if I can make a minimal bag file that triggers this behavior.

The fix is pretty simple and makes the node more resilient to non-optimal inputs.

SteveMacenski · 2022-07-18T18:24:07Z

If the scan messages are coming in at 20Hz, then the timeout is effectively 50ms.

I don't understand that comment, why do you believe that's true? I do not believe message filters is going to preempt trying to find a transform of a message frame just because a new message came in. It should be using the transform timeout duration and passing along from that, other messages would simply be dropped if a new one came in while it was checking for a transform.

I would try to find a solution within the use of message filters, that is going to be staying. If there's some bug in message filters, then we can resolve that over in that repo

malban · 2022-07-18T18:28:18Z

why do you believe that's true

Because that is what I'm experiencing. I'll try to create a minimal bag file.

I would try to find a solution within the use of message filters

that would be ideal

SteveMacenski · 2022-07-18T18:39:46Z

I'm heading out on PTO in 4 days for about a month, so don't make a bag file on my account, I'm not going to be able to open it until I'm back at the earliest.

I'd look at TF message filters code directly and see if you can find the logic for handling messages that you think you're experiencing. That's not been my understanding of how it works, so if not, that might point to another potential issue

malban · 2022-07-19T02:06:33Z

well, I went ahead a created some bag files anyway, since it will help with debugging the message filter regardless.

The bag files contain exactly the same messages, /scan and /tf, with /tf just containing the transform between laser and odom.

The valid bag file works as expected when played back into slam_toolbox. The shifted_50ms bag file has been modified to move all of the tf messages 50ms later in the bag file so they get played slightly later. This is enough to cause the message filter to drop all of the messages with messages like this:

Message Filter dropping message: frame 'laser' at time 60.746 for reason 'discarding message because the queue is full'

valid.zip
shifted_50ms.zip

SteveMacenski · 2022-07-19T19:04:09Z

Why exactly is your TF delayed so much? I know that's not really the point here, but that seems illustrative of a problem. Global localization being delayed or irregular is baked into the system, but localized odometry is super important not only for SLAM Toolbox, but for pretty much every consumer of transformation information in the system. Message filters are everywhere in ROS so if you have a problem here, you're going to run into that all over the place.

Uncharacteristically, I don't have a solution in mind for this. I suppose you could add delay your laser scans, e.g. /delayed_scans, its not like they're very important to be timely for the purposes of this node providing global localization. Even if we added a larger queue size, you're basically banking on the latency from the ROS system or processing here to virtually delay the scan's processing which would be flaky at best. If we add canTransform to the callback, we're blocking for TF and owning that logic which message filter provides inline. This is a non-trivial problem I can say that I've never run into. I suppose 20+hz odometry with reasonably low latency has always been a basis of which I've started to build robots from, and when it wasn't possible, I went to the low levels of the robot to correct it rather than marching forward as "this is how it is".

malban · 2022-07-19T20:30:42Z

Thanks for your thoughts on this. I agree with your points about low level odometry.

If we add canTransform to the callback, we're blocking for TF and owning that logic which message filter provides inline.

You're already blocking for TF, with the intent of blocking up to 500ms by default, though that is way overkill. In practice, on a well tuned system, as you say, there will be little if any blocking on TF, because it will be timely. So having the large default delay doesn't hurt anything in the normal case, but it does makes slam_toolbox more robust to sub-optimal inputs.

Unfortunately, the tf message filter doesn't seem to work as expected in this case, with queue size 1.

The message filters are great, but it's saving a single, very common, tf check at the top of the scan callbacks. The explicit check will result in the intended behavior and provide more useful diagnostic information than what is currently output.

I suppose 20+hz odometry with reasonably low latency has always been a basis of which I've started to build robots from, and when it wasn't possible, I went to the low levels of the robot to correct it rather than marching forward as "this is how it is"

I don't have control of the low level system or it's constraints. Undoubtedly, things are sub-optimal, but my understanding of the intended callback design in slam_toolbox, based on your comment, is that it should wait up to 500ms for the transform:

I do not believe message filters is going to preempt trying to find a transform of a message frame just because a new message came in. It should be using the transform timeout duration and passing along from that, other messages would simply be dropped if a new one came in while it was checking for a transform.

The bag files I uploaded should demonstrate it's not working as intended and the messages are likely getting preempted as new messages push them off the queue.

I'll investigate if there is an easy fix in the tf message filter, but it's possible that it's the result of it's fundamental design vs. a simple bug.

In the meantime, the MR I made resolves the issue for me, so it's not something I'm searching for a solution for. I do think it would probably be beneficial to other users, which is why I made the issue and MR.

SteveMacenski · 2022-07-20T18:00:59Z

with queue size 1

Does a queue size of 2 make any such difference if you think there's an edge case not properly handled? It wouldn't be a big deal either way actually to keep a queue size of 2 if it functionally resolves a problem.

malban · 2022-07-20T19:11:45Z

So, it does look like it's an issue of the tf message filter preempting messages that are in queue while they are waiting for transforms.

I made a branch which prevents the message filter from preempting the oldest message that is waiting for a transform and it seems to work as I would expect: ros2/geometry2@galactic...malban:geometry2:prevent-message-preemption-galactic

Would you mind taking a look at that? If you think it looks reasonable I'll make a MR on geometry2

Does a queue size of 2 make any such difference

Queue size 2 doesn't seem to work for the bag data I uploaded, but 3 does. I'm not sure why 2 doesn't work, I'll need to step through the message filter code to see why.

SteveMacenski · 2022-07-21T18:01:33Z

Queue size 2 doesn't seem to work for the bag data I uploaded, but 3 does. I'm not sure why 2 doesn't work, I'll need to step through the message filter code to see why.

I suspect that's just a function of the specific delay in your system, not a general rule.

I'll take a look!

SteveMacenski · 2022-07-21T18:03:39Z

Yeah, that patch looks reasonable! Tag me in the PR and I'll make sure to keep track of it

malban · 2022-07-21T21:13:36Z

I suspect that's just a function of the specific delay in your system, not a general rule.

You're probably right, it's just a little weird that none of the messages appear to be dropped in the valid bag, but all of them are dropped in the shifted_50ms bag for both queue size 1 and 2.

LeifHookedWireless · 2023-10-25T19:29:44Z

I'm seeing something very similar.
My transforms are coming over the ROS1->ROS2 bridge so they have high latency: 35-40ms.
However my scans are being published on an old lidar, also over the bridge with frequency or 7-8Hz and latency of ~140ms.
If transform_publish_period is small (<5ms), then the transforms are published steadily but the scans are rarely , if ever, processed. i.e. AsynchrousSlamToolbox::laserCallback() is simply not called 99% of the time.
Of course the incoming scans fill the buffer in the MessageFilter and the console is spammed with discarding message because the queue is full.

I haven't determined the root cause, but I assume that the laserCallback() thread is being starved somehow by the handling of transform updates.

In my case, setting transform_publish_period >= 0.05 seems to fix the problem.

May not be the same issue, but I'm posting this here in case someone else has the same problem and comes looking for a solution.

This was referenced Jul 18, 2022

Replace TF message filter with explicit canTransform checks #517

Open

Replace TF message filter with explicit canTransform checks (ros2) #518

Open

malban mentioned this issue Jul 21, 2022

Prevent tf message filter from preempting the oldest message while it is waiting for transforms. ros2/geometry2#544

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf2_ros::MessageFilter<sensor_msgs::msg::LaserScan>> dropping all messages if tf latency is greater than scan message rate #516

tf2_ros::MessageFilter<sensor_msgs::msg::LaserScan>> dropping all messages if tf latency is greater than scan message rate #516

malban commented Jul 18, 2022

SteveMacenski commented Jul 18, 2022

malban commented Jul 18, 2022 •

edited

Loading

SteveMacenski commented Jul 18, 2022 •

edited

Loading

malban commented Jul 18, 2022

SteveMacenski commented Jul 18, 2022

malban commented Jul 19, 2022

SteveMacenski commented Jul 19, 2022 •

edited

Loading

malban commented Jul 19, 2022

SteveMacenski commented Jul 20, 2022

malban commented Jul 20, 2022

SteveMacenski commented Jul 21, 2022

SteveMacenski commented Jul 21, 2022

malban commented Jul 21, 2022

LeifHookedWireless commented Oct 25, 2023

tf2_ros::MessageFilter<sensor_msgs::msg::LaserScan>> dropping all messages if tf latency is greater than scan message rate #516

tf2_ros::MessageFilter<sensor_msgs::msg::LaserScan>> dropping all messages if tf latency is greater than scan message rate #516

Comments

malban commented Jul 18, 2022

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

SteveMacenski commented Jul 18, 2022

malban commented Jul 18, 2022 • edited Loading

SteveMacenski commented Jul 18, 2022 • edited Loading

malban commented Jul 18, 2022

SteveMacenski commented Jul 18, 2022

malban commented Jul 19, 2022

SteveMacenski commented Jul 19, 2022 • edited Loading

malban commented Jul 19, 2022

SteveMacenski commented Jul 20, 2022

malban commented Jul 20, 2022

SteveMacenski commented Jul 21, 2022

SteveMacenski commented Jul 21, 2022

malban commented Jul 21, 2022

LeifHookedWireless commented Oct 25, 2023

malban commented Jul 18, 2022 •

edited

Loading

SteveMacenski commented Jul 18, 2022 •

edited

Loading

SteveMacenski commented Jul 19, 2022 •

edited

Loading