Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rcl test test_publisher_init_fini does not fini #585

Closed
rotu opened this issue Feb 29, 2020 · 12 comments
Closed

rcl test test_publisher_init_fini does not fini #585

rotu opened this issue Feb 29, 2020 · 12 comments
Assignees

Comments

@rotu
Copy link
Contributor

rotu commented Feb 29, 2020

The tests in test_publisher_init_fini create a publisher (rcl_publisher_init) without finalizing it (rcl_publisher_fini). This causes a failed assertion in CycloneDDS in debug mode when tearing down the node (rcl_node_fini), and possibly a memory leak. It's clear from the name of the test that the test was intended to have fini calls, but it's not clear whether or not RCL and/or CycloneDDS should issue a warning/error in this case.

// Check that valid publisher is valid
publisher = rcl_get_zero_initialized_publisher();
ret = rcl_publisher_init(&publisher, this->node_ptr, ts, topic_name, &default_publisher_options);
EXPECT_EQ(RCL_RET_OK, ret) << rcl_get_error_string().str;
EXPECT_TRUE(rcl_publisher_is_valid(&publisher));
rcl_reset_error();

#3  0x00007f60aba81412 in __GI___assert_fail (assertion=0x7f60ab818d1a "ddsrt_hh_iter_first (gv->sertopics, &it) == NULL", file=0x7f60ab818642 "../../src/eclipse-cyclonedds/cyclonedds/src/core/ddsi/src/q_init.c", line=1780, function=0x7f60ab818dd1 "void rtps_fini(struct ddsi_domaingv *)") at assert.c:101
#4  0x00007f60ab7a306b in rtps_fini (gv=0x10ad010) at ../../src/eclipse-cyclonedds/cyclonedds/src/core/ddsi/src/q_init.c:1780
#5  0x00007f60ab7dfa82 in dds_domain_free (vdomain=0x10acd70) at ../../src/eclipse-cyclonedds/cyclonedds/src/core/ddsc/src/dds_domain.c:269
#6  0x00007f60ab7e440a in dds_entity_deriver_delete (e=0x10acd70) at ../../src/eclipse-cyclonedds/cyclonedds/src/core/ddsc/src/dds__types.h:189
#7  0x00007f60ab7e5159 in really_delete_pinned_closed_locked (e=0x10acd70, delstate=DIS_EXPLICIT) at ../../src/eclipse-cyclonedds/cyclonedds/src/core/ddsc/src/dds_entity.c:503
#8  0x00007f60ab7e4e07 in dds_delete_impl_pinned (e=0x10acd70, delstate=DIS_EXPLICIT) at ../../src/eclipse-cyclonedds/cyclonedds/src/core/ddsc/src/dds_entity.c:400
#9  0x00007f60ab7e4651 in dds_delete_impl (entity=983535177, delstate=DIS_EXPLICIT) at ../../src/eclipse-cyclonedds/cyclonedds/src/core/ddsc/src/dds_entity.c:375
#10 0x00007f60ab7e4d25 in dds_delete (entity=983535177) at ../../src/eclipse-cyclonedds/cyclonedds/src/core/ddsc/src/dds_entity.c:358
#11 0x00007f60b2cb1cda in node_gone_from_domain_locked (did=42) at ../../src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:568
#12 0x00007f60b2cb235f in rmw_destroy_node (node=0x12cd9b0) at ../../src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:833
#13 0x00007f60b33c0240 in rcl_node_fini (node=0x10b2200) at ../../src/ros2/rcl/rcl/src/rcl/node.c:457
#14 0x00000000004135f3 in TestPublisherFixture__rmw_cyclonedds_cpp::TearDown (this=0x12cf4c0) at ../../src/ros2/rcl/rcl/test/rcl/test_publisher.cpp:66
#15 0x00000000004562e4 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0x12cf4c0, method=&virtual testing::Test::TearDown(), location=0x462c3a "TearDown()") at ../../install/src/gtest_vendor/./src/gtest.cc:2447
@rotu
Copy link
Contributor Author

rotu commented Feb 29, 2020

FYI: @eboasson

@eboasson
Copy link
Contributor

That hash table stores refcounted "sertopic" objects and the RMW layer caches references to these in its publisher and subscription objects. The absence of rcl_node_fini means the RMW objects are not properly finalised, which in turn keeps the sertopic alive.

They are only ever used for writing a serialized message, so there is something to be said for not caching them at all in the subscription, and cache in the publisher only if they serve a purpose.

@rotu
Copy link
Contributor Author

rotu commented Feb 29, 2020

Don’t these sertopics need to be around for as long at the topic may be sent or received so they do have to be cached for at least the lifetime of the publisher/subscriber?

@eboasson
Copy link
Contributor

eboasson commented Mar 1, 2020

@rotu, yes, but because they have to be around they are already managed by Cyclone: each topic entity carries a reference to the associated sertopic, ensuring they will remain in existence while there is still data. The refcounting of these things is to allow multiple DDS topic entities to share the same sertopic (this is what relies on those compare and hash functions where you found a few issues with calls to memcmp, &c.)

The issue here is that for publishing serialised messages — rmw_publish_serialized_message — the RMW layer needs it and holds a reference to it. That this reference is counted is what keeps the thing alive outside the Cyclone library and causes the assertion when Cyclone shuts down without first tearing down the node. Note there is not really a need to count it, because it is guaranteed to remain at least for the lifetime of the topic (and thus of the publisher) anyway.

All this does uncover a bug: the avoidance of duplicate sertopics in Cyclone works fine, but it means that following the dds_create_topic_arbitrary call, the sertopic may, but need not, be the one actually used. If is a duplicate of an existing, registered, one, that existing one will be used instead. The consequence is that the current implementation potentially caches the wrong one. Curiously, this discrepancy has no ill effects, but it is a bug.

Fixing this is easy, just use the sertopic that is actually used — but there is currently no way of getting that information … I suppose I will need to add a (trivial) function to retrieve it. E.g.:

const struct ddsi_sertopic *dds_get_topic_sertopic(dds_entity_t topic);

I think it'd be reasonable to make the return value a const and not modify the reference count: all the functions that operate on one (except for "unref" take a const pointer). That pointer can be cached (or the function be called from rmw_publish_serialized_message). I haven't decided on the name yet …

@dpotman
Copy link
Contributor

dpotman commented Mar 3, 2020

@rotu I've fixed the sertopic referencing issue, see the two PR references above. Cyclone now takes ownership of the sertopic and returns a pointer to rmw that can be safely used during the lifetime of the topic/writer.

@claireyywang
Copy link

@dennis-adlink @rotu safe to close this issue?

@rotu
Copy link
Contributor Author

rotu commented Mar 12, 2020

@claireyywang no, the test still doesn’t exercise the creation and destruction that it’s supposed to. It should.

@rotu
Copy link
Contributor Author

rotu commented Mar 12, 2020

Also I’m still not clear whether this is a correct use of the rcl API (deleting a node with live publishers). If so, both use cases should be tested

@sloretz
Copy link
Contributor

sloretz commented Apr 20, 2020

Also I’m still not clear whether this is a correct use of the rcl API (deleting a node with live publishers). If so, both use cases should be tested

This part of rcl_node_fini suggests deleting the node with live publishers is a valid use case.

rcl/rcl/include/rcl/node.h

Lines 156 to 157 in 0a795cc

* Any middleware primitives created by the user, e.g. publishers, services, etc.,
* are invalid after deinitialization.

The documentation suggests rcl_publisher_is_valid() will return false after rcl_node_fini() is called, though it's unclear if the publishers to still need to be ...fini()'d.

bool
rcl_publisher_is_valid(const rcl_publisher_t * publisher);

The implementation of rcl_node_fini() does not seem to invalidate any entities created from it, so that doesn't match documentation.

rcl/rcl/src/rcl/node.c

Lines 378 to 421 in 0a795cc

rcl_ret_t
rcl_node_fini(rcl_node_t * node)
{
RCUTILS_LOG_DEBUG_NAMED(ROS_PACKAGE_NAME, "Finalizing node");
RCL_CHECK_ARGUMENT_FOR_NULL(node, RCL_RET_NODE_INVALID);
if (!node->impl) {
// Repeat calls to fini or calling fini on a zero initialized node is ok.
return RCL_RET_OK;
}
rcl_allocator_t allocator = node->impl->options.allocator;
rcl_ret_t result = RCL_RET_OK;
rcl_ret_t rcl_ret = RCL_RET_OK;
if (rcl_logging_rosout_enabled() && node->impl->options.enable_rosout) {
rcl_ret = rcl_logging_rosout_fini_publisher_for_node(node);
if (rcl_ret != RCL_RET_OK && rcl_ret != RCL_RET_NOT_INIT) {
RCL_SET_ERROR_MSG("Unable to fini publisher for node.");
result = RCL_RET_ERROR;
}
}
rmw_ret_t rmw_ret = rmw_destroy_node(node->impl->rmw_node_handle);
if (rmw_ret != RMW_RET_OK) {
RCL_SET_ERROR_MSG(rmw_get_error_string().str);
result = RCL_RET_ERROR;
}
rcl_ret = rcl_guard_condition_fini(node->impl->graph_guard_condition);
if (rcl_ret != RCL_RET_OK) {
RCL_SET_ERROR_MSG(rmw_get_error_string().str);
result = RCL_RET_ERROR;
}
allocator.deallocate(node->impl->graph_guard_condition, allocator.state);
// assuming that allocate and deallocate are ok since they are checked in init
allocator.deallocate((char *)node->impl->logger_name, allocator.state);
allocator.deallocate((char *)node->impl->fq_name, allocator.state);
if (NULL != node->impl->options.arguments.impl) {
rcl_ret_t ret = rcl_arguments_fini(&(node->impl->options.arguments));
if (ret != RCL_RET_OK) {
return ret;
}
}
allocator.deallocate(node->impl, allocator.state);
node->impl = NULL;
RCUTILS_LOG_DEBUG_NAMED(ROS_PACKAGE_NAME, "Node finalized");
return result;
}

And rcl_publish_fini() will only fini the publish if the node is still valid

rcl/rcl/src/rcl/publisher.c

Lines 219 to 222 in 0a795cc

RCL_CHECK_ARGUMENT_FOR_NULL(publisher, RCL_RET_PUBLISHER_INVALID);
if (!rcl_node_is_valid_except_context(node)) {
return RCL_RET_NODE_INVALID; // error already set
}

In short it looks like the implementation expects another layer to handle the burden of destroying things in the right order. I'm not sure if rcl expects that burden to be handled by the rmw implementation or the client library, but I think it should be latter.

I don't think the rmw implementation should be expected to clean up publishers when the node is finalized because that expectation isn't documented on rmw_destroy_node(). There's similarly no documentation on rmw_destroy_publisher(). Also, it would be duplicate work. The client libraries rclcpp and rclpy already try to ensure things are destructed in the right order, so there's no need to repeat it in the lower layers.

In short I think besides the test this is a documentation issue. I would recommend doing all of:

  • Fix the referenced test so it calls ...fini() on the publisher
  • Replace the documentation on rcl_node_fini() saying all publishers/etc become invalid with something saying that all things created from the node must be fini'd first, otherwise the behavior is undefined
  • Document on rmw_destroy_node() that a higher layer will make sure things created from the node were already finalized, so reference counting in the rmw layer is unnecessary
  • Document on the rcl_publisher/etc_fini() calls that they must be called before the node is finalized

@sloretz
Copy link
Contributor

sloretz commented Apr 20, 2020

Since the test was fixed already, I'll close this issue and let the documentation PRs stand on their own.

@sloretz sloretz closed this as completed Apr 20, 2020
@rotu
Copy link
Contributor Author

rotu commented Apr 20, 2020

Thanks, @sloretz!

@wjwwood
Copy link
Member

wjwwood commented Apr 21, 2020

The documentation suggests rcl_publisher_is_valid() will return false after rcl_node_fini() is called, though it's unclear if the publishers to still need to be ...fini()'d.

I think the assumption was that if you fini the node, it would fini any "children" automatically, making any "dangling" pointers to those things, like a rcl_publisher_t now invalid. But it is definitely preferable to fini things in the reverse order of initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants