Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Segmentation messages #71

Merged

Conversation

pepisg
Copy link
Contributor

@pepisg pepisg commented Jun 28, 2022

In this PR I aim to add a semantic segmentation message. I'm aware of the discussion going on on #63, however I think using images as masks may leave behind important metadata like the class_map (the name of the class corresponding to each ID). Also, using several channels in the image for sending the mask and the confidence of each pixel could lead to varying implementations that may not be compatibles with standard functionalities relying on semantic segmentation like the in progress semantic segmentation costmap plugin PR in nav2.

@mintar
Copy link
Contributor

mintar commented Jun 28, 2022

I agree that having additional metadata would be useful, but is that worth the disadvantages I've outlined in #63 (comment)?

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 29, 2022

@mintar totally agree. Looking at the proposal, I don't know how to massage all that content into a sensor_msgs/Image. Do you have a suggestion for that? For instance the confidence of the predictions and the threshold the algorithm is told to use about whether or not something is sufficiently confident for use. class_map also seems important to know what means what -- though I think that could be a potentially massive thing that might not be ideal to embed into the messages (maybe a camera_info -> semantic_info eq. is needed in addition the Image to contain that info)?

I suppose the confidence_threshold could be a ROS parameter for the end user without too much of an issue.

Between parameters and a semantic_info topic with the class maps, I think that handles quite a bit of the issue. The only thing that I'm not sure how to model with Image is the confidences.

Maybe we use a 2 channel image format, 1 for the class and a second for the confidence? Does the CV tools work for 2 channel images or would we need to add in a 3rd for visualization purposes. I suppose the 3rd channel could be threshold in case there are different thresholds for different classes / sections of the images, but I'm not sure if that's a 'normal' thing to do or I'm just coming up with ideas to try to fill space.


# the confidence of the inference of each pixel
# between 0-100%
uint8[] confidence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the full range 0-255 here?

Copy link
Member

@SteveMacenski SteveMacenski Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not standard for the output of AI libraries, they usually give out a probability value. It would be a bit unnatural to convert it to 0-255 for use (e.g. I want to use predictions that are at least 80% confident). I'd argue this should actually be a float (?) from 0-1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a float, but does it really make a difference to have a confidence of say 60% (uint8) or 60.5% (float) knowing that this would increase the space used by this array by 4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specially on an array field that is expected to contain a lot of elements. On a 640x360 mask changing this for a float32 would mean an increase of 640x360x3 bytes, almost 700kB per message with respect to uint8

Copy link
Member

@SteveMacenski SteveMacenski Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should make sure to explain these are uints not floats, otherwise people will just put their 0-1 percentages from TF or something into it and it'll always render as 0 since it needs to be multiplied by 100 and then cast to an int (in the comment above)


# Integer value corresponding to the value of pixels belonging
# to a given class in a segmentation mask
uint16 class_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a uint16 when the data field in SemanticSegmentation.msg is only uint8?

Limiting the number of classes to 255 is probably bad, but doubling the message size for the segmentation image by making it a uint16 is equally bad. Anyhow, the two should be consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, it's probably better to leave it as uint16. If we use a sensor_msgs/Image as the segmentation image format, we can choose between mono8 and mono16 encodings as appropriate there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't speak to that exactly without his comment, but there can easily be more than 255 class values in segmentation algorithms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, they must be consistent. I will use uint16 for the final version of the PR

@@ -0,0 +1,10 @@
# A key value pair that maps an integer class_id to a string class label.
# The class_id should be interpreted as the pixel value corresponding to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's trailing whitespace in this line which makes the tests fail. Also both files are missing a newline at the end of the file.

@mintar
Copy link
Contributor

mintar commented Jun 29, 2022

@SteveMacenski wrote:

@mintar totally agree. Looking at the proposal, I don't know how to massage all that content into a sensor_msgs/Image. Do you have a suggestion for that? For instance the confidence of the predictions and the threshold the algorithm is told to use about whether or not something is sufficiently confident for use. class_map also seems important to know what means what -- though I think that could be a potentially massive thing that might not be ideal to embed into the messages (maybe a camera_info -> semantic_info eq. is needed in addition the Image to contain that info)?

I don't think the class_map would be that massive. Let's be generous and assume 20 characters per class name, and let's assume we have 255 classes. Then the whole array would be 255 * (20 + 1) = 5355 bytes. A 640x480 uint8 image has 307200 bytes, so the class_map would be around 1.7% of that. On the other hand, 5kb overhead is not nothing, so I'd like to hear better ideas.

BTW, how to handle the mapping from int to class string is also an open issue with the existing messages (Detection3DArray etc.). I currently handle that by having a yaml file with the mapping that I load into the namespace of every single node that requires it. I would prefer it though if it was available as a message - that way if you record a rosbag, it would reflect the mapping at the time the rosbag was recorded.

I like the idea of publishing all the metadata on a "semantic_info" topic, similar to the camera_info topic. If we're really worried about the overhead of the class_map, perhaps we don't publish the semantic_info in sync with each frame, but instead only whenever it changes on a latched topic?

Maybe we use a 2 channel image format, 1 for the class and a second for the confidence? Does the CV tools work for 2 channel images or would we need to add in a 3rd for visualization purposes. I suppose the 3rd channel could be threshold in case there are different thresholds for different classes / sections of the images, but I'm not sure if that's a 'normal' thing to do or I'm just coming up with ideas to try to fill space.

Here's a list of all OpenCV image encodings: sensor_msgs/image_encodings.h

You can have 2-channel images, like 8UC2 (8 bit, unsigned, 2 channels) or 16UC2, and OpenCV handles them just fine. I'm not sure how well image_transport (for on-the-fly PNG or JPG compression) or visualization tools like RViz handle these more exotic formats. Or all the other stuff like image_proc, image_geometry and so on.

Alternatively, the 2 images (semantic segmentation and confidence) could be published on 2 different topics. That would have the following advantages:

  • You can use different encodings for the images. For example, mono8 or mono16 for the semantic segmentation image (depending whether you have more than 255 classes or not), and mono8 or 32FC1 or whatever for the confidence image.
  • You could also have one confidence image per class if that's important for your application
  • You'd have better support for image_transport and visualization tools (see above).

@SteveMacenski
Copy link
Member

5kb overhead is not nothing, so I'd like to hear better ideas.

Have a semantic_info topic / message like we have a camera_info topic / message for metadata regarding the camera. Now its metadata regarding the AI system. That way we can only grab this once (or pair up with the sensor_msgs/Image) and let us use the Image message without class definitions in it. That's not uncommon in my experience with camera_info to just grab 1 message and then stop subscribing to it.

would prefer it though if it was available as a message

Maybe then this is more generic than just a semantic_info topic but a vision_info topic? Latched seems fine to me!

the 2 images (semantic segmentation and confidence) could be published on 2 different topics.

That is interesting. I wonder about the run-time impacts of that since anything about 1-2MB in size has a hard time publishing without zero-copy or composition. I'm not sure if its better to have one 2MB message or two 1MB messages going into the same node. I suspect 1 message is better from some internal Samsung projects I'm helping with, but I don't have the data specifically to support that at the moment.

@pepisg
Copy link
Contributor Author

pepisg commented Jun 29, 2022

I agree with the idea of using sensor_msgs/msg/Image instead. From what you have been discussing, this is an sketch of how a semantic_info message would look like

vision_msgs/LabelMap[] label_map # uint16 - string key value pairs
float32 confidence_threshold

This message seems to also work for object detection models using the other messages on vision_msgs. int64 is used for ids there, but can there really be more than 65536 classes?
If using a single multichannel image for packing the mask and the confidence, I think it may be worth adding a field to somehow tell how the channels in the image are organized, this way it would not be necessary to define a fixed convention to use segmentation data with other ROS packages like nav2. This however would break the idea of having a message to fit AI applications

With that in mind, when things are settled I can reformat this PR to include these messags

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 30, 2022

VisionInfo maybe a good use for the message name if we plan to use this for detections and segmentations?

I think uint16 is a good bet for the moment and the current state of AI. If in another decade our magic AIs can know millions of things, we can always increase it.

@pepisg
Copy link
Contributor Author

pepisg commented Jul 1, 2022

Should we settle for two distinct images for the mask and confidence? Or should we use a single dual-channel image and define a convention to specify which channel should contain what? This could be overcame by adding a metadata field to the VisionInfo message to make explicit the channel order, or applications relying on semantic segmentation data (like the segmentation plugin for nav2) could take in some parameter in their configuration to let them know where to look.

The only drawback of using two images seems to be the potential impact of publishing 2 data intensive messages instead of one. That would also force the message consumer to synchronize them (which can be done with a message filter, but drifts from ros standard implementation). This approach seems to be more flexible and does not rely on assumed conventions nor requires extra configuration parameters on the consumers.

Let me know what you think to fix the PR and adjust the work on the segmentation plugin

@mintar
Copy link
Contributor

mintar commented Jul 1, 2022

I'd prefer two separate image topics instead of a two-channel image.

@SteveMacenski
Copy link
Member

I prefer a single 2-channel image over 2 separate images 😆

Maybe we need to ask @tfoote what his thoughts are. He usually brings up very good points when I propose interface definitions.

@mintar
Copy link
Contributor

mintar commented Jul 1, 2022

One thing that speaks for two separate images is that you can have separate data types. You wrote yourself that it's possible to have more than 255 classes. In that case, the segmentation image would need to be a uint16. If you have a single 2 channel image, that would mean the confidence image must also be uint16. With separate images, you could always use uint8 for the confidence image and uint8 or uint16 for the segmentation image, depending on the number of classes that you have.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 1, 2022

The option behind door number three is that we actually don't include confidence information at all, and make as part of the Interface API that any pixel below a certain threshold should be set to -1, meaning invalid, as part of populating the image. Therefore, when any consumer of this work is parsing the image, any value of -1 means it was too low of confidence for use, provided by the publisher node. Just push that logic down.

@mintar
Copy link
Contributor

mintar commented Jul 2, 2022

We can't do -1 with an unsigned int, but in semantic segmentation images "0" usually means "unknown class". The confidence image is still nice extra info to have.

@mintar
Copy link
Contributor

mintar commented Jul 2, 2022

Another advantage of two separate topics: Not all subscribers may be interested in the confidence values, so they don't even need to subscribe to that topic.

@SteveMacenski
Copy link
Member

We can't do -1 with an unsigned int, but in semantic segmentation images "0" usually means "unknown class". The confidence image is still nice extra info to have.

Well -1 is wrapped around in unsigned ints, so it would be max. I wouldn't want to override 0 since that could be a meaningful value for a detection/segmentation algorithm and it would be annoying if we had to have folks increment all their outputs by 1 to meet the standard the interface defines. Its far less disruptive / likely that a collision would occur at 65,536

I suppose we could do both; publish the confidence image and also set the precedence that 65,536 is a reserved value for not confident enough, if the publisher can know that value. That would get both worlds, but also require algorithms implementing this to act about both, if available.

@tfoote
Copy link

tfoote commented Jul 5, 2022

I prefer a single 2-channel image over 2 separate images laughing

Maybe we need to ask @tfoote what his thoughts are. He usually brings up very good points when I propose interface definitions.

The primary reason for choosing to send one multi channel image versus 2 separate images would be if they generally would be used usefully separately. The simplest examples are stereo images. If you want just a monocular camera stream, you don't want to incur the overhead of subscribing to the full stereo image just to ignore the second half of the stream. On the flip side there is a small amount of overhead related to synchronizing incoming images, and they have to have enough embedded content to be useful on their own (aka they both have full headers). We have tools to make it easier, but the extra overhead is why we don't send everything decoupled.

The only drawback of using two images seems to be the potential impact of publishing 2 data intensive messages instead of one. That would also force the message consumer to synchronize them (which can be done with a message filter, but drifts from ros standard implementation). This approach seems to be more flexible and does not rely on assumed conventions nor requires extra configuration parameters on the consumers.

I would disagree and say that publishing the synchronized image messages is the ROS standard. We do it for stereo images, compressed images, and camera_info.

And if you're publishing two images instead of one they should be half the size each as they will have half the data each (or some ratio depending on the datatypes)

The other major concern in this sort of discussion is encapsulation. You want to make sure that messages generally can stand on their own without knowing the context that they were broadcast in. Such that if you get a rosbag of some random data it will have everything that you need to process it later.

@mintar
Copy link
Contributor

mintar commented Jul 6, 2022

@SteveMacenski wrote:

Well -1 is wrapped around in unsigned ints, so it would be max. I wouldn't want to override 0 since that could be a meaningful value for a detection/segmentation algorithm and it would be annoying if we had to have folks increment all their outputs by 1 to meet the standard the interface defines. Its far less disruptive / likely that a collision would occur at 65,536

I've checked what other semantic segmentation / panoptic segmentation datasets are using, and most of them seem to use 255 for the "UNLABELED" class (bdd100k semantic, COCO-Stuff), although some use 0 (bdd100k panoptic, CityScapes). So I agree we should probably use 255 here. The only trouble is that "-1" means 255 if the semantic segmentation image encoding is mono8, and "-1" means 65,536 if the encoding is mono16. I would like to allow both encodings for the segmentation image.

@tfoote wrote:

The primary reason for choosing to send one multi channel image versus 2 separate images would be if they generally would be used usefully separately. [...] And if you're publishing two images instead of one they should be half the size each as they will have half the data each (or some ratio depending on the datatypes).

If we include the UNLABELED class (which we absolutely should), then the segmentation image is very useful on its own. I think this probably covers at least 80% of use cases; most clients probably won't care about the confidence image. So that would speak for having two separate image topics.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 6, 2022

I'd like to bring up though that we're now talking about having to synchronize 3 topics potentially (vision_info, segmentation_mask, segmentation_mask_confidences) for a "typical and full" implementation. That seems to be approaching excessiveness, even though I can live with the individual technical points that leads to that outcome.

I'm not sure I agree that they would both be useful separately. The segmentation mask without confidences could be interesting, but the confidences without the segmentation mask is not - unlike the stereo example where either image feed could be individually interesting without the other.

@mintar
Copy link
Contributor

mintar commented Jul 6, 2022

If we go with the latched topic for the vision messages, we only need to synchronize 2 topics, and only those clients that are interested in the confidences. You're right that the confidence image is not interesting on its own, but Tully's argument still holds: you save 50% data in a lot of use cases by having 2 topics.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 6, 2022

There would need to be a level of documentation required for that workflow -- e.g. it would not be obvious to anyone just looking at the vision_info message / topic that it would be a transient local QoS published 1 time at initialization and cannot be synchronized with the mask/detections. While camera_info you can do a same thing with, that's being constantly published with the images too. So this would be a slightly different design pattern.

I'd think to be analog to existing similar things, we'd want to publish with the masks/detections as well, regardless of QoS. Those that want to grab it 1 time to use forever can do so, but then others that want it synchronized for whatever reason can, just like camera_info. I think we have to assume that some people will want to do that, which would involve 3 synchronized topics.

@mintar
Copy link
Contributor

mintar commented Jul 6, 2022

Publishing vision_info in sync is fine with me!

@pepisg
Copy link
Contributor Author

pepisg commented Jul 8, 2022

To sum up, the consensus seems to be using 2 sensor_msgs/msg/Image along with a new vision_info message with the following structure:

vision_msgs/LabelMap[] label_map # uint16 - string key value pairs
float32 confidence_threshold

I will refactor this PR to only include the new vision_info and label_map messages.

Additionally according to @SteveMacenski suggestion some documentation should be added describing how:

  1. The -1 value in the segmentation mask, which would end up being the greatest unsigned value of the chosen encoding, means the pixel belongs to the UNLABELED class
  2. The workflow for getting vision_info messages when published using a transient local qos
  3. The workflow for synchronizing three topics containing a class mask, a confidence mask and the vision_info

Does that sound good @SteveMacenski @mintar ?

Also, please let me know what would be the best place to add the documentation

@mintar
Copy link
Contributor

mintar commented Jul 9, 2022

Sounds good! Comments on how to interpret the data fields (for example, the UNLABELED class) should go into the message comments of that field. More extensive documentation about the workflows should go into the README, IMO.

@SteveMacenski
Copy link
Member

Agree, or maybe add in a vision_msgs_example package that has some trivial example node subscribing & synchronizing the 3 just so there's some boilerplate folks can work with (or just a snippet in the readme is fine)

@pepisg
Copy link
Contributor Author

pepisg commented Jul 13, 2022

Hi @mintar, I just realized there is already a VisionInfo message defined here. Should I update it with what was discussed here or should I create another one called say InferenceInfo? I will do the latter since I imagine changing an interface can break people's code easily, but I will change back in case you prefer to do so

@mintar
Copy link
Contributor

mintar commented Jul 14, 2022

Huh, I forgot about the existing VisionInfo. I agree that we can't change it (adding fields is usually ok, but not completely remodeling the message), so we should have a new message. I'm not fond of the name InferenceInfo, but I can't think of a better name. @SteveMacenski , @Kukanani: Any ideas?

README.md Outdated
@@ -47,6 +47,14 @@ http://wiki.ros.org/message_filters#Policy-Based_Synchronizer_.5BROS_1.1.2B-.5D)
in your code, as the message's header should match the header of the source
data.

Semantic segmentation pipelines should use `sensor_msgs/Image` messages for publishing segmentation and confidence masks. This allows to use all the ROS tools for image processing easily and to choose the most lightweight encoding for each type of message. To transmit the metadata associated with the vision pipeline the [`/vision_msgs/InferenceInfo`](msg/InferenceInfo.msg) message can be used analogously to how `/sensor_msgs/CameraInfo` message and `/sensor_msgs/Image` are:

1. The `InferenceInfo` topic can be latched so that new nodes joining the ROS system can get the messages that were published since the beginning. In ROS2 this can be achieved using a `transient local` QoS profile.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be inference_info when talking about the topic (as opposed to the message type InferenceInfo).

README.md Outdated

2. The subscribing node can get and store one `InferenceInfo` message and cancel its subscription after that. This assumes the provider of the message publishes it periodically.

3. The `InferenceInfo` message can be synchronized with the `Image` messages that containing the segmentation mask and the confidence values of the inference of each pixel. This can be achieved using ROS's `message_filters`. The same [snippet](http://wiki.ros.org/message_filters#Policy-Based_Synchronizer_.5BROS_1.1.2B-.5D) mentioned above can work as an example.
Copy link
Contributor

@mintar mintar Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this incompatible with mentioning the latched topic above? If we use the time synchronizers from message_filters, we need to publish a InferenceInfo message for every image with the exact same time stamp (just like the camera_info topic). The section about latched publishers above sounds like we only publish a message once (or whenever the contents change), analogous to static TFs or maps.

I'm fine with both, BTW.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. InferenceInfo should be one of these two:

  1. A message truly like CameraInfo - it is not expected to change. Caching it locally on other nodes should be fine. Latched publishers should be fine.
  2. It can't be assumed to be the same for each published Image. If that's the case, we need to remove all mention of latching and caching, and require that people subscribe to both the Image and the InferenceInfo. But if they're always being published in pairs, you'll need to remind me why we aren't combining them into one message.

Copy link
Contributor Author

@pepisg pepisg Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be both ways, that's why I want to place both options in the readme to describe what would be the common ways of using the message. The third option I propose also assumes the message does not change but does not require the use of a latched topic. I think that it is worth referencing at least the first two but let me know if you think we should settle for only one of them

# constant) over time, and so it is wasteful to send it with each individual
# result. By listening to these messages, subscribers will receive
# the context in which published vision messages are to be interpreted.
# Each vision pipeline should publish its VisionInfo messages to its own topic,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VisionInfo -> InferenceInfo

@Kukanani
Copy link
Collaborator

I like the interface that we've settled on here. In terms of naming, if InferenceInfo is specifically for use with semantic segmentation, should we call it something like SegmentationInfo?

Either way, I'm going to suggest some documentation improvements for clarity.

README.md Outdated
@@ -47,6 +47,14 @@ http://wiki.ros.org/message_filters#Policy-Based_Synchronizer_.5BROS_1.1.2B-.5D)
in your code, as the message's header should match the header of the source
data.

Semantic segmentation pipelines should use `sensor_msgs/Image` messages for publishing segmentation and confidence masks. This allows to use all the ROS tools for image processing easily and to choose the most lightweight encoding for each type of message. To transmit the metadata associated with the vision pipeline the [`/vision_msgs/InferenceInfo`](msg/InferenceInfo.msg) message can be used analogously to how `/sensor_msgs/CameraInfo` message and `/sensor_msgs/Image` are:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Semantic segmentation pipelines should use `sensor_msgs/Image` messages for publishing segmentation and confidence masks. This allows to use all the ROS tools for image processing easily and to choose the most lightweight encoding for each type of message. To transmit the metadata associated with the vision pipeline the [`/vision_msgs/InferenceInfo`](msg/InferenceInfo.msg) message can be used analogously to how `/sensor_msgs/CameraInfo` message and `/sensor_msgs/Image` are:
Semantic segmentation pipelines should use `sensor_msgs/Image` messages for publishing segmentation and confidence masks. This allows systems to use standard ROS tools for image processing, and allows choosing the most compact image encoding appropriate for the task. To transmit the metadata associated with the vision pipeline, you should use the [`/vision_msgs/InferenceInfo`](msg/InferenceInfo.msg) message. This message works the same as `/sensor_msgs/CameraInfo or [`/vision_msgs/VisionInfo`](msg/VisionInfo.msg):

README.md Outdated
@@ -47,6 +47,14 @@ http://wiki.ros.org/message_filters#Policy-Based_Synchronizer_.5BROS_1.1.2B-.5D)
in your code, as the message's header should match the header of the source
data.

Semantic segmentation pipelines should use `sensor_msgs/Image` messages for publishing segmentation and confidence masks. This allows to use all the ROS tools for image processing easily and to choose the most lightweight encoding for each type of message. To transmit the metadata associated with the vision pipeline the [`/vision_msgs/InferenceInfo`](msg/InferenceInfo.msg) message can be used analogously to how `/sensor_msgs/CameraInfo` message and `/sensor_msgs/Image` are:

1. The `InferenceInfo` topic can be latched so that new nodes joining the ROS system can get the messages that were published since the beginning. In ROS2 this can be achieved using a `transient local` QoS profile.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. The `InferenceInfo` topic can be latched so that new nodes joining the ROS system can get the messages that were published since the beginning. In ROS2 this can be achieved using a `transient local` QoS profile.
1. Publish `InferenceInfo` to a topic. The topic should be at same namespace level as the associated image. That is, if your image is published at `/my_segmentation_node/image`, the `InferenceInfo` should be published at `/my_segmentation_node/inference_info`. Use a latched publisher for `InferenceInfo`, so that new nodes joining the ROS system can get the messages that were published since the beginning. In ROS2, this can be achieved using a `transient local` QoS profile.

README.md Outdated

1. The `InferenceInfo` topic can be latched so that new nodes joining the ROS system can get the messages that were published since the beginning. In ROS2 this can be achieved using a `transient local` QoS profile.

2. The subscribing node can get and store one `InferenceInfo` message and cancel its subscription after that. This assumes the provider of the message publishes it periodically.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. The subscribing node can get and store one `InferenceInfo` message and cancel its subscription after that. This assumes the provider of the message publishes it periodically.
2. A subscribing node may receive an `InferenceInfo` message, store it locally, then cancel its subscription.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need to assume that the publisher publishes periodically - that's the point of making it latched

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its my understanding that a latched topic uses the transient local QoS. In contrast, this second method does not require the usage of any specific QoS, that's why messages should be published periodically to allow new nodes to get the messages.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

README.md Outdated

2. The subscribing node can get and store one `InferenceInfo` message and cancel its subscription after that. This assumes the provider of the message publishes it periodically.

3. The `InferenceInfo` message can be synchronized with the `Image` messages that containing the segmentation mask and the confidence values of the inference of each pixel. This can be achieved using ROS's `message_filters`. The same [snippet](http://wiki.ros.org/message_filters#Policy-Based_Synchronizer_.5BROS_1.1.2B-.5D) mentioned above can work as an example.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. InferenceInfo should be one of these two:

  1. A message truly like CameraInfo - it is not expected to change. Caching it locally on other nodes should be fine. Latched publishers should be fine.
  2. It can't be assumed to be the same for each published Image. If that's the case, we need to remove all mention of latching and caching, and require that people subscribe to both the Image and the InferenceInfo. But if they're always being published in pairs, you'll need to remind me why we aren't combining them into one message.

msg/InferenceInfo.msg Outdated Show resolved Hide resolved
@pepisg
Copy link
Contributor Author

pepisg commented Jul 15, 2022

I like the interface that we've settled on here. In terms of naming, if InferenceInfo is specifically for use with semantic segmentation, should we call it something like SegmentationInfo?

This message is not only meant for segmentation but for object detection as well, or in general to any vision pipeline that returns id based classifications that you later may want to convert back to human readable string class names.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 15, 2022

LabelInfo? ClassInfo? Then the individual instances would be Label or Class, plus or minus a Vision.

I agree, I'm not a fan of InferenceInfo

@jmm-slamcore
Copy link
Contributor

@pepisg is this still relevant? Think it would be great to get this merged to bring more obj. det./semantics capabilities to ROS2/Nav2

@SteveMacenski
Copy link
Member

agreed!

@Kukanani
Copy link
Collaborator

Kukanani commented Nov 16, 2022

List of outstanding items I see before this can merge:

  • Alternate name for InferenceInfo. I propose SegmentationInfo but am also happy with ClassInfo or LabelInfo or perhaps some other proposal. Author's choice :)
  • Clarify all comments and docs to either be latched/transient local Qos (and no mention of "being like CameraInfo" or time synchronizers), or continually published. I don't have a preference, but we should be consistent in our intended usage, or else it will cause confusion
  • Other misc comment typos/clarifications as proposed in PR comments

Anything else?

@pepisg
Copy link
Contributor Author

pepisg commented Nov 16, 2022

Alternate name for InferenceInfo. I propose SegmentationInfo but am also happy with ClassInfo or LabelInfo or perhaps some other proposal. Author's choice :)

Done. left it LabelInfo following @SteveMacenski 's suggestion and my earlier thought about this messages enabling users to convert id encoded data from vision pipelines to human readable class names. which may apply for object detection as well

Clarify all comments and docs to either be latched/transient local Qos (and no mention of "being like CameraInfo" or time synchronizers), or continually published. I don't have a preference, but we should be consistent in our intended usage, or else it will cause confusion

We seemed to agree on being two possible setups: a) latched topic; b) node subscribes to topic, gets one message and then unsubscribes. There are several comments on the PR which I tried to address on the last commit regarding this. Please let me know if you have any further comments on this section of the documentation.

Other misc comment typos/clarifications as proposed in PR comments

I guess I integrated all your change requests, let me know if there is anything else I should change / include

@SteveMacenski
Copy link
Member

@Kukanani are you happy with these?

Copy link
Collaborator

@Kukanani Kukanani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, ship it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants