Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support --repeat-latched in ROS 2 (transient local) #1159

Open
amacneil opened this issue Nov 11, 2022 · 23 comments
Open

Support --repeat-latched in ROS 2 (transient local) #1159

amacneil opened this issue Nov 11, 2022 · 23 comments
Labels
enhancement New feature or request

Comments

@amacneil
Copy link
Contributor

amacneil commented Nov 11, 2022

Description

ROS 1 supported a --repeat-latched feature (ros/ros_comm#1850) to repeat latched topics at the start of each bag split.

As far as I can tell, rosbag2 does not have this feature. Is there any reason we should not add it?

@amacneil amacneil added the enhancement New feature or request label Nov 11, 2022
@MichaelOrlov
Copy link
Contributor

@amacneil Yes we have the reason to not do this for the sake to have deterministic replay.
This is duplicate of the #1051 please see my detailed explanation in #1051 (review)

@amacneil
Copy link
Contributor Author

I'll admit I am not super familiar with how tf_static and transient local works in ROS 2 (I am more familiar with latched topics in ROS 1). ros2/ros2#464 seems to have some details.

We have users who would like to open one file from a split bag and have the transforms work correctly. Your comment in that issue seems to be based on the fact that you don't have a way in rosbag2 to "play" a split bag file, but we do support opening a single file in Foxglove. I think it would be good to get more feedback from users on whether they need this feature.

@MichaelOrlov
Copy link
Contributor

If users want something it doesn't means that we should jump in and implement what they want without considering other cases and drawbacks.

My main point not in the case that rosbag2 doesn't support playing a split bag rather that we need to have determinism in replay. Basically we shall not inject or copy messages which we didn't receive on the wires in any cases.

We have a lot another cases when transient local durability uses for instance camera images or LIDAR streams - and what? We should copy all messages from camera topic from previous file to the next file when doing split?

If someone want to repeat tf_static messages published once at the beginning before playing back bag after split, one can use rosbag2 play with duration option and topic filtering for playing back just messages from tf_static then right after playback normally bag which was recorded after split.
Another option is to use ros2 bag convert to create new bag where will be messages from tf_static from first bag and the rest will be content from bag after split.

@jhurliman
Copy link

We have a lot another cases when transient local durability uses for instance camera images or LIDAR streams - and what? We should copy all messages from camera topic from previous file to the next file when doing split?

My understanding of transient local durability is there is a user-configured history size, so it would only copy all messages from the previous file for a topic if the previous file was very short or the history size was unusually large. The default history size is 10, and 5 for the sensor QoS profile.

@MichaelOrlov
Copy link
Contributor

@jhurliman Though the way how it's organized in DDS and ROS2 is that publisher has cache with N number of messages when transient local durability is used and when new subscriber joining to publisher - subscriber will be getting the last N messages from publisher cache rather than just one latest message.

The case when transient local durability uses for tf_static topic and there is a publisher only publishes a few messages at the beginning is the one particular case.

It could be another cases when transient local uses for camera images topics or LIDAR and there are continuous rgb data stream. Newly created subscriber will get N last images from cache and then all other following sequences from ongoing running stream since publisher will continuously publish new messages.

@jhurliman @amacneil After some consideration I tend to agree that we can implement this feature but not as default behavior. Only as an option with parameter given via CLI.

Some technical details and consideration how to implement this feature.

  • Unfortunately we can't re-read messages from publisher cache without creating a new subscription. We will have to go back to the written topics with transient local QoS settings read out all messages from them and repeat to the new bag after split. Or we can limit number of messages to repeat via CLI parameter . As for instance --repeat latched 10 which is supposed to repeat first 10 messages. Or maybe last 10 messages? Last 10 messages looks like more consistent - but probably more difficult to implement.
  • Please also consider performance penalty for doing this read out and save in new file during split.
  • Alternative way in creating a new subscriptions for the same topics to read out publisher cache would take even more time and resources due to the discovery issues.
  • Split operation should take very small amount of time and resources since we are locking cache from dumping messages to storage during this operation and probability to drop messages increases if split taking longer time.
  • If I am not very mistaken the current implementation of the rosbag2_storage_mcap plugin doesn't really support read-write interface. It's either read or write mode. Please consider this as well.

Thoughts?

@emersonknapp emersonknapp changed the title Support --repeat-latched in ROS 2 Support --repeat-latched in ROS 2 (transient local) Jan 21, 2023
@ga58lar
Copy link

ga58lar commented May 30, 2023

@amacneil @MichaelOrlov Hi everyone, are there any updates on this feature?
This feature is exactly what my colleagues and I are currently looking for to record tf_static messages on each split and thus make it possible to play a single bag of a split with all information.

@MichaelOrlov
Copy link
Contributor

@ga58lar Unlikely we will implement this feature since it contradicts the entire DDS communication approach and unavoidably will introduce a significant impact on the performance of the split operation and will cause messages drops during recording.

I would suggest a workaround.
If you need to playback only messages from the tf_static topic at the beginning of some bag file or at some random time, you can use ros2 bag convert command https://github.com/ros2/rosbag2#converting-bags to extract messages from tf_static topic to a separate bag file and use separate ros2 bag play tf_static player instance to playback those messages when you would need them.
You can use launcher yaml file to define the order of the playback calls.

@emersonknapp
Copy link
Collaborator

emersonknapp commented Jun 7, 2023

The way I understand this, there are these different cases being considered here:

  1. During playback (ros2 bag play) from the beginning - we want latched topics to persist between bag splits so that late joiners receive the messages, even post split.
  2. When starting play from a later timestamp that happens to be in a later split file - we still want those publishers to have the messages they theoretically would have
  3. When analyzing bags as data objects (and perhaps separating splits from each other), we want to see messages in the bag representing the current latched/transient_local message buffers

I believe that Case 1 actually already works correctly. When you start playback from the beginning, publishers are created with the appropriate QoS -- so if the original topic in the recorded system had all durability=TRANSIENT_LOCAL then the rosbag2 playback publisher will do the same. Then it will have the same latched behavior to new joiners, even across splits because playback publishers are not shut down/recreated across splits.

Case 2 on the other hand would still exhibit the same behavior but only for messages originally published after the timestamp of the start of playback. To give the publishers the internal information, we could do a check for latched publishers and do a "burst publish" of their topic from the beginning of the bag up to the starting point, which would act that way.

Case 3 would prove more difficult, though I don't see it as an engineering impossibility. We might create a separate entry type, a "internal latch-buffer" topic at the beginning of new splits, that is not played back during normal operation, but used only used for the "burst publish latch priming" if starting playback from that file.

Unfortunately the History Depth (latching count) of the publisher is not always broadcast such that rosbag2 would know what it is. DDS considers History Depth (as well as Lifespan but that's less relevant here) to be an internal QoS configuration, other participants don't need to know about it, so it's not guaranteed in every implementation to be conveyed in discovery.

Like @MichaelOrlov did bring up, though, the potential extra features I mentioned are not without performance cost

@tonynajjar
Copy link

tonynajjar commented Apr 3, 2024

Hey, I just encountered this issue when trying to visualize split bag files in Foxglove. @amacneil or @MichaelOrlov I understand that this is WIP but did you find a workaround I could use at the moment, either on the foxglove or rosbag2 side? The only option for me is to publish /tf_static more frequently which beats its purpose, or transfer my static TFs to /tf and publish at high rate which is a waste of resources. Thanks

Maybe @ga58lar you have an idea as well?

P.S: @MichaelOrlov your proposed workaround would not work in my use case unfortunately

@ga58lar
Copy link

ga58lar commented Apr 3, 2024

@tonynajjar Unfortunately, I have not found a really good solution that does not defeat the criticism raised.
I am now publishing the static transforms on every bag-split event of the recording. I think the ease of use for splitted bags is worth the overhead.
In my case, all static transforms are located within one URDF file and I am using the robot_state_publisher.
Therefore, I just added one subscription to events/write_split and execute publishFixedTransforms(); in the callback.

https://github.com/ros/robot_state_publisher/blob/ros2/src/robot_state_publisher.cpp#L151

  // subscribe to bag split events
  bag_split_sub_ = this->create_subscription<rosbag2_interfaces::msg::WriteSplitEvent>(
    "events/write_split",
    rclcpp::SensorDataQoS(),
    std::bind(&RobotStatePublisher::callbackBagSplit, this, std::placeholders::_1),
    subscriber_options);

https://github.com/ros/robot_state_publisher/blob/ros2/src/robot_state_publisher.cpp#L364

void RobotStatePublisher::callbackBagSplit([[maybe_unused]] rosbag2_interfaces::msg::WriteSplitEvent::SharedPtr message)
{
  // re-publish tf_static from urdf file on every bag split
  publishFixedTransforms();
  return;
}

It would be better though to subscribe to /tf_static too and re-publish the topic instead of loading the static transforms on each split

@tonynajjar
Copy link

tonynajjar commented Apr 3, 2024

Thanks for sharing @ga58lar, it does look like an okay workaround when all your static transforms come from the robot_state_publisher; unfortunately it's not the case for me, they come from at least 3 different sources. I find it quite dirty to enable the regular republishing of /tf_static from all the sources :/

@tonynajjar
Copy link

tonynajjar commented Apr 10, 2024

For anyone wanting to implement this in their own fork in the future I implemented it on humble: angsa-robotics#1 (inspired from #1051).

I don't like maintaining a long-term fork but I ran out of options. @MichaelOrlov I understand why you don't want this to be the default behavior but I think having the option to enable it is not so bad

@Aposhian
Copy link
Contributor

As @tonynajjar mentioned, this feature is very useful for doing analysis of single split files, either with static analysis (e.g. Foxglove) or even with playback. By embedding transient local topics in each split, then playing back each split appears as if the recorder was just a late subscriber at the time of the split start. So while yes, this feature is not exactly correct, it is incredibly useful, and I think it makes sense to have as an opt-in feature.

@MichaelOrlov
Copy link
Contributor

@Aposhian @tonynajjar OK, it looks like it draws a picture for the feature design proposal.

  1. Will need to add a new --repeat-latched CLI option to the recorder.
  2. --repeat-latched CLI option will accept a queue depth, the number of messages to memorize on each topic. The default value is 0. i.e. the feature disabled.
  3. The uint32_t repeat_latched shall be added to the rosbag2_transport::RecordOptions data structure. The default value equal to 0 will mean that the --repeat-latched CLI option is not selected.
  4. Repeat-latched feature will memorize in RAM only the first N messages from all topics with QoS durability settings = RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL.
  5. Repeat-latched feature will save previously memorized messages at the beginning of each new bag file after the split.
  6. Repeat-latched feature will modify sent and received timestamps for memorized messages to be current before dumping them into the bag file. i.e., it shall be the same timestamps as from the latest received message.
  7. The same rosbag2_transport::RecordOptions.repeat_latched parameter shall be supported for the ros2 bag rewrite CLI.

@Aposhian
Copy link
Contributor

Aposhian commented Aug 8, 2024

@MichaelOrlov do you have any issues with @tonynajjar's implementation? https://github.com/angsa-robotics/rosbag2/pull/1/files

@tonynajjar
Copy link

@Aposhian I suppose the objection is the same as mentioned here since it's the same implementation

@MichaelOrlov
Copy link
Contributor

@tonynajjar @Aposhian We just need to rewrite it to adhere to the plan proposal in the #1159 (comment)

@Rayman
Copy link
Contributor

Rayman commented Jan 24, 2025

One issue I have with this proposal is that the queue depth could be different per topic. As an example:

  • /map is a large data structure, you want only the last one
  • /tf_static you want the last published TFMessage per publisher on a topic. Suppose there are five nodes which publish a TFMessage, then a late-joiner wants to receive 5 TFMessage.

@MichaelOrlov
Copy link
Contributor

MichaelOrlov commented Jan 24, 2025

@emersonknapp You mentioned in today's tooling WG meeting that it was a way to know about the publisher queue size but not all DDS implementations support it.
Could you please point out how to get the publisher queue size on the subscriber side? I would like to consider this option.

@emersonknapp
Copy link
Collaborator

emersonknapp commented Jan 24, 2025

@MichaelOrlov the History Depth is the number of messages that get persisted in the queue and re-sent to new participants when using Transient Local Durability. In at least one of the dds implementations, this History Depth is filled correctly when we discover the reported qos in get endpoint info by topic https://github.com/ros2/rosbag2/blob/rolling/rosbag2_transport%2Fsrc%2Frosbag2_transport%2Frecorder.cpp#L697

In the other one, the History Depth number is always filled with the default value (I think that's 10). I can't remember right now which one of cyclone vs fastdds does which behavior.

If I remember correctly, the dds spec doesn't dictate this behavior because History Depth does not affect matching compatibility, so it's not required to be reported. I could be wrong about that, in which case this is a bug in the one implementation.

@MichaelOrlov
Copy link
Contributor

I traced it down that all 3 DDS rmw layers (CycloneDDS, FastDDS, and ConnextDDS) use graph cache common_context->graph_cache.get_writers_info_by_topic(..) to get those info.
Now the question is how those information gets to the graph cache from different DDS implementations.

@MichaelOrlov
Copy link
Contributor

MichaelOrlov commented Jan 28, 2025

Okay. I've tried to research topic about publisher history depth and how it is getting to the graph_cach.
Essentially QOS gets to the graph_cache via the dds_qos_to_rmw_qos(const DDSQoSPolicyT & dds_qos, rmw_qos_profile_t * qos) conversion function, which is specialized for each DDS RMW implementation.
The only limitation I found is that history depth is invalid when using a history kind equal to the KEEP_ALL, and in CycloneDDS, in particular, it is settled up to the -1. However, on ROS 2 CycloneDDS RMW layer it is mapped to 0 in case of the KEEP_ALL history kind.
As regards the FasDDS, the history depth is preserved in any case. However, it is supposed to have an invalid value in the case of the KEEP_ALL history kind.

In all other cases, the history dept is expected to have a valid value! If not, it should be considered a bug.

For --repeat-latched feature consideration I would propose to limit queue depth by a minimum value in between publisher->qos.depth() which we can get from the node->get_publishers_info_by_topic(topic_name) call and N repeat latched messages specified in CLI option.

As a follow-up, will need to write a simple unit test with a publisher and subscriber node and a call to the node->get_publishers_info_by_topic(topic_name) to see in what cases and with what DDS implementations it doesn't work as expected.
Then, if faulty behavior is found and it is not with the KEEP_ALL history kind, - we need to create a BUG report and a new issue for a relevant DDS implementation or RMW layer.

@MichaelOrlov
Copy link
Contributor

cc: @mjcarroll This feature, --repeat-latched N for the Rosbag2 recorder, could be a good candidate for the summer code exercise.
We more or less agreed on the design of this feature. See details in the #1159 (comment) and my latest message above #1159 (comment).

However, the problem is that almost everybody wants this feature ASAP.
I think If nobody will take it and complete it before summer - it could be a candidate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants