Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uORB: fix: When multiple publishers publish data to the same node, any publisher cancels the publication and the node becomes invalid #16084

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

shawnfeng0
Copy link
Contributor

@shawnfeng0 shawnfeng0 commented Oct 29, 2020

When multiple publishers use DeviceNode instance 0 to publish data in uORB, any node cancels publishing, which will cause the node to be marked as unadvertised, and the data cannot be obtained by subscribers.

This fix counts the number of publishers to determine whether there are still publishers.

To this end, a test case uORBTest::UnitTest::test_single_unadvertise() has been written to test this situation.

Note: The mismatch between orb_advertise and orb_unadvertise may cause errors. But if you use the C++ interface, this problem will not exist.

This fix is inspired by #16012 (comment)

@shawnfeng0 shawnfeng0 force-pushed the uorb_publisher_count branch from 964df29 to 5b107ca Compare October 29, 2020 07:41
@dagar dagar requested review from dagar and bkueng October 29, 2020 14:03
@dagar
Copy link
Member

dagar commented Oct 29, 2020

Are you going to remove the hack in the destructor as well?

if (_handle != nullptr) {
// don't automatically unadvertise queued publications (eg vehicle_command)
if (static_cast<DeviceNode *>(_handle)->get_queue_size() == 1) {
unadvertise();
}

@dagar
Copy link
Member

dagar commented Oct 29, 2020

Adding the number of publishers to uorb status/top might also be enlightening.

update: 1s, num topics: 65
TOPIC NAME                     INST #SUB RATE #Q SIZE
actuator_armed                    0    8    1  1   24 
actuator_controls_0               0    6  796  1   48 
actuator_outputs                  0    3   50  1   80 
battery_status                    0    8    2  1  120 
commander_state                   0    1    1  1   16 
cpuload                           0    4    2  1   16 
multirotor_motor_limits           0    2   50  1   16 
px4io_status                      0    1    1  1  128 
rate_ctrl_status                  0    1  796  1   24 
safety                            0    2    1  1   16 
sensor_accel                      0    4  796  8   48 
sensor_combined                   0    2  199  1   48 
sensor_gyro                       0    4  796  8   40 
sensor_preflight_mag              0    1   50  1   16 
sensors_status_imu                0    1  199  1   72 
system_power                      0    2  100  1   24 
telemetry_status                  0    2    1  1   56 
telemetry_status                  1    2    1  1   56 
vehicle_acceleration              0    2  199  1   32 
vehicle_angular_acceleration      0    1  796  1   32 
vehicle_angular_velocity          0    6  796  1   32 
vehicle_control_mode              0   14    1  1   32 
vehicle_imu                       0    5  199  1   56 
vehicle_imu_status                0    4   10  1   88 
vehicle_land_detected             0    7    1  1   24 
vehicle_status                    0   22    1  1   56 
vehicle_status_flags              0    1    1  1   40 

PX4_INFO_RAW(CLEAR_LINE "%-*s INST #SUB RATE #Q SIZE\n", (int)max_topic_name_length - 2, "TOPIC NAME");

@shawnfeng0
Copy link
Contributor Author

Are you going to remove the hack in the destructor as well?

if (_handle != nullptr) {
// don't automatically unadvertise queued publications (eg vehicle_command)
if (static_cast<DeviceNode *>(_handle)->get_queue_size() == 1) {
unadvertise();
}

Yes, it will be done in another pull request soon.

@shawnfeng0 shawnfeng0 force-pushed the uorb_publisher_count branch from 5b107ca to be57bce Compare October 29, 2020 16:04
@shawnfeng0 shawnfeng0 changed the title fix: When multiple publishers publish data to the same node, any publisher cancels the publication and the node becomes invalid uORB: fix: When multiple publishers publish data to the same node, any publisher cancels the publication and the node becomes invalid Oct 29, 2020
@shawnfeng0 shawnfeng0 marked this pull request as ready for review October 29, 2020 16:05
@shawnfeng0
Copy link
Contributor Author

Adding the number of publishers to uorb status/top might also be enlightening.
...

PX4_INFO_RAW(CLEAR_LINE "%-*s INST #SUB RATE #Q SIZE\n", (int)max_topic_name_length - 2, "TOPIC NAME");

Done.

src/modules/uORB/uORBDeviceNode.hpp Outdated Show resolved Hide resolved
src/modules/uORB/uORBDeviceMaster.cpp Outdated Show resolved Hide resolved
@shawnfeng0
Copy link
Contributor Author

shawnfeng0 commented Nov 3, 2020

This is really a complicated question.
I have an idea, I am not sure if it is right, I think I can discuss it here first. Is it possible to make uORB a relatively independent module?

  • Does not use the px4_open() function, nor does it use something similar to the file system, and DeviceNode is created and obtained through DeviceMaster.
  • Redefine an orb_poll(), just like px4_poll of POSIX platform, we can implement it separately.
  • Use orb_subscription_t to replace the subscriber's fd

In this way, uORB can delete many layers. We only need to build a framework on the POSIX interface, which can be used on nuttx and linux.

If we can do this, uORB will be easier to modify, and the architecture will be simpler and clearer. It will also reduce some memory and flash usage.

@dagar
Copy link
Member

dagar commented Nov 3, 2020

This is really a complicated question.
I have an idea, I am not sure if it is right, I think I can discuss it here first. Is it possible to make uORB a relatively independent module?

  • Does not use the px4_open() function, nor does it use something similar to the file system, and DeviceNode is created and obtained through DeviceMaster.
  • Redefine an orb_poll(), just like px4_poll of POSIX platform, we can implement it separately.
  • Use orb_subscription_t to replace the subscriber's fd

In this way, uORB can delete many layers. We only need to build a framework on the POSIX interface, which can be used on nuttx and linux.

If we can do this, uORB will be easier to modify, and the architecture will be simpler and clearer. It will also reduce some memory and flash usage.

We're actually pretty close to not needing the underlying file descriptors at all with the vast majority of subscriptions in the system using uORB::Subscription. Some background #11176.

@dagar
Copy link
Member

dagar commented Nov 3, 2020

Can you rebase on master to get the uORB test changes? You could then uncomment line 830 so we can ensure the failure discussed in #16093 is resolved.

//return test_fail("sub %d generation should be %d + 1, but it's %d", sub_instance, last_gen, sub.get_last_generation());
PX4_ERR("sub %d generation should be %d + 1, but it's %d", sub_instance, last_gen, sub.get_last_generation());

@shawnfeng0 shawnfeng0 marked this pull request as draft November 4, 2020 06:45
@shawnfeng0 shawnfeng0 force-pushed the uorb_publisher_count branch 2 times, most recently from c0ddc04 to 13d26c0 Compare November 12, 2020 19:06
@shawnfeng0 shawnfeng0 marked this pull request as ready for review November 12, 2020 19:06
@shawnfeng0 shawnfeng0 requested a review from bkueng November 12, 2020 19:18
@shawnfeng0
Copy link
Contributor Author

shawnfeng0 commented Nov 12, 2020

I thought of the solution. Unified increase the number of publishers in _device_master->advertise().

@shawnfeng0 shawnfeng0 force-pushed the uorb_publisher_count branch 3 times, most recently from 37b882a to 6069066 Compare January 10, 2021 19:28
@shawnfeng0
Copy link
Contributor Author

In uORB::DeviceMaster::advertise, first check whether there is an existing node, if not, then create the node, and register to the file system through node->init().

Another advantage of this is that the creation is performed only when it is determined that the node needs to be created, instead of using new to create it and then determining whether it exists.

@shawnfeng0
Copy link
Contributor Author

shawnfeng0 commented Jan 10, 2021

Two months ago, I was busy with other things. I reviewed my commit and made some changes.

@shawnfeng0
Copy link
Contributor Author

Please review this pull request. @bkueng

@bkueng
Copy link
Member

bkueng commented Jan 14, 2021

Yes I will

Copy link
Member

@bkueng bkueng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks correct, I remarked a couple of details that need to be updated.

src/modules/uORB/uORBDeviceNode.cpp Outdated Show resolved Hide resolved
src/modules/uORB/uORBDeviceNode.hpp Outdated Show resolved Hide resolved
src/modules/uORB/uORBDeviceMaster.cpp Outdated Show resolved Hide resolved
src/modules/uORB/uORBDeviceMaster.cpp Outdated Show resolved Hide resolved
src/modules/uORB/uORBDeviceMaster.cpp Outdated Show resolved Hide resolved
src/modules/uORB/uORBDeviceNode.hpp Outdated Show resolved Hide resolved
@shawnfeng0 shawnfeng0 marked this pull request as draft January 20, 2021 16:22
@shawnfeng0 shawnfeng0 force-pushed the uorb_publisher_count branch from 6327d01 to 5bc66d4 Compare January 20, 2021 16:45
@shawnfeng0 shawnfeng0 marked this pull request as ready for review January 20, 2021 16:45
@shawnfeng0
Copy link
Contributor Author

The code looks correct, I remarked a couple of details that need to be updated.

Thank you. It has been changed and rebase to the latest master branch.

@bkueng
Copy link
Member

bkueng commented Jan 21, 2021

Thanks. A remaining problem is that the pub counter keeps increasing for some topics, like vehicle_command, and will eventually overflow. We need to unadvertise all, and we can keep a list of topics that we always want to have advertised.

CI failure is unrelated:

            conf:          0 GB        64 KB      0.00/opt/gcc/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/bin/ld: spracing_h7extreme_default.elf section `.itcm_code' will not fit in region `itcm'
/opt/gcc/bin/../lib/gcc/arm-none-eabi/9.3.1/../../../../arm-none-eabi/bin/ld: region `itcm' overflowed by 312 bytes
collect2: error: ld returned 1 exit status

@mrpollo
Copy link
Contributor

mrpollo commented Feb 15, 2021

@bkueng what is needed to get this PR in?

@bkueng
Copy link
Member

bkueng commented Feb 16, 2021

See my latest comment. A more minimal change would be to clamp the counter to avoid overflow.

@shawnfeng0
Copy link
Contributor Author

I think there may not be a situation where publisher_count_ keeps increasing, as long as publishing and unpublishing appear in pairs, but I have not had time to test a lot. Sorry, I will test it soon.

@stale
Copy link

stale bot commented Jun 26, 2021

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

@stale stale bot added the stale label Jun 26, 2021
@lukegluke
Copy link
Contributor

Any progress here?
Encountered on memory leak on stopping/starting custom module that use PublicationMulti for custom topic, due to no automatic unadvertising. As I see it could be also related to PX4 modules/drivers too.

I agree that publisher_count should not keeps increasing if the hack in the PublicationBase destructor will be removed and there will be unadvertising on each advertising and of course if number of active publishers is less that coutner size.

@stale stale bot removed the stale label Jun 3, 2022
@shawnfeng0
Copy link
Contributor Author

Sorry, I've been busy with my other work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants