Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👨‍🌾 rosbag2_transport Test regressions with cyclonedds + flaky with fastrtps #933

Closed
Blast545 opened this issue Dec 8, 2021 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@Blast545
Copy link
Contributor

Blast545 commented Dec 8, 2021

Description

New test regressions in the windows buildfarm jobs

nightly_win_rel
Patient zero: https://ci.ros2.org/view/nightly/job/nightly_win_rel/2137/

projectroot.test_record__rmw_cyclonedds_cpp
rosbag2_transport.RecordIntegrationTestFixture.receives_latched_messages

nightly_win_deb

Patient zero according to logs: https://ci.ros2.org/view/nightly/job/nightly_win_deb/2192/
Patient zero according to CI: https://ci.ros2.org/view/nightly/job/nightly_win_deb/2193/

projectroot.test_record__rmw_cyclonedds_cpp
rosbag2_transport.RecordIntegrationTestFixture.qos_is_stored_in_metadata
rosbag2_transport.RecordIntegrationTestFixture.records_sensor_data
rosbag2_transport.RecordIntegrationTestFixture.receives_latched_messages

Expected Behavior

All test pass, build result is SUCCESS.

Actual Behavior

The mentiones tests are failing.

Additional context

I have reasons to believe it's related to #920. I'll open a PR reverting #920 #921 and #930 to see CI results.

Partial error log:

26: [----------] 6 tests from RecordIntegrationTestFixture^M
26: [ RUN      ] RecordIntegrationTestFixture.published_messages_from_multiple_topics_are_recorded^M
26: stdin is not a terminal or conso    qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqle device. Keyboard handling disabled.[INFO] [1638535392.124965000] [rosbag2_recorder]: Press SPACE for pausing/resu\
ming^M
26: [INFO] [1638535392.125188100] [rosbag2_recorder]: Listening for topics...^M
26: [INFO] [1638535392.133262200] [rosbag2_recorder]: Subscribed to topic '/string_topic'^M
26: [INFO] [1638535392.139954900] [rosbag2_recorder]: Subscribed to topic '/array_topic'^M
26: [INFO] [1638535392.149985100] [rosbag2_recorder]: All requested topics are subscribed. Stopping discovery...^M
26: [       OK ] RecordIntegrationTestFixture.published_messages_from_multiple_topics_are_recorded (1000 ms)^M
26: [ RUN      ] RecordIntegrationTestFixture.qos_is_stored_in_metadata^M
26: stdin is not a terminal or console device. Keyboard handling disabled.[INFO] [1638535393.100816600] [rosbag2_recorder]: Press SPACE for pausing/resuming^M
26: [INFO] [1638535393.100995300] [rosbag2_recorder]: Listening for topics...^M
26: [ERROR] [1638535393.101764300] [ROSBAG2_TRANSPORT]: Topic '/chatter' has more than one type associated. Only topics with one type are supported^M
...
26: [ERROR] [1638535403.095305700] [ROSBAG2_TRANSPORT]: Topic '/chatter' has more than one type associated. Only topics with one type are supported^M
26: C:\ci\ws\src\ros2\rosbag2\rosbag2_transport\test\rosbag2_transport\test_record.cpp(107): error: Value of: pub_manager.wait_for_matched(topic.c_str())^M
26:   Actual: false^M
26: Expected: true^M
26: [INFO] [1638535403.208178900] [rosbag2_recorder]: Subscribed to topic '/chatter'^M
26: [INFO] [1638535403.208383000] [rosbag2_recorder]: All requested topics are subscribed. Stopping discovery...^M
26: [  FAILED  ] RecordIntegrationTestFixture.qos_is_stored_in_metadata (10456 ms)^M

The reason of the error seems to be related to the topic chatter having another type associated to it. This doesn't make sense to me, as the chatter publisher seems to be uniquely instantiated here:

pub_manager.setup_publisher(topic, string_message, 2);

@Blast545 Blast545 added the bug Something isn't working label Dec 8, 2021
@emersonknapp
Copy link
Collaborator

emersonknapp commented Dec 8, 2021

This would probably me more closely related to #924 - which implements topic filtering, including "multi-type detection". May be more fruitful to look at that one first.

Based on https://ci.ros2.org/view/nightly/job/nightly_win_deb/2193/#showFailuresLink - it seems that "graceful shutdown" stuff might also be failing? In this case, perhaps there are some orphaned test publishers from other tests, publishing to /chatter, that are of a different type than the one intended in the test? Not sure, just a hypothesis.

It may be worth considering some convention of "universally unique topic name" (e.g. use a UUID as the namespace or something) for test topics, so that there is no possibility of confusion on that particular front.

@Blast545
Copy link
Contributor Author

This would probably me more closely related to #924

@emersonknapp I think it's related to #920 mainly because the timing it was merged and the time the error starting appearing coincides. However, as the test failiure relates to a state maybe left from another test, I'll take a depeer look to see where it comes from.

it seems that "graceful shutdown" stuff might also be failing?

You mean the test failures related to graceful shutdown? Those seem to be errors that are not happening anymore, you can see the latest nightly: https://ci.ros2.org/view/nightly/job/nightly_win_deb/2204/

It may be worth considering some convention of "universally unique topic name" (e.g. use a UUID as the namespace or something) for test topics

I have mixed feelings on that front, maybe if a test requires a complete clean state it should have pre condition test checks or if a particular failure test can leave an unclean test it should test it.

@Blast545 Blast545 changed the title 👨‍🌾 Test egressions in the nightly_win_rel and nightly_win_deb jobs 👨‍🌾 Test regressions in the nightly_win_rel and nightly_win_deb jobs Dec 21, 2021
@Blast545
Copy link
Contributor Author

👨‍🌾 Some jobs I triggered to get info on the cause of the problem:

Windows job, forcing debug log to track others tests using the chatter topic: Build Status
(The error happens, although is not logged in the job summary)

Testing if the error ocurrs when the package is tested without testing other packages:
Build Status
Build Status
Only fails when tested with other packages.

The error only ocurrs with cycloneDDs, I wasn't able to test it only with that DDS without ros2/rmw_implementation#201 merged.

@hidmic hidmic self-assigned this Dec 23, 2021
@hidmic
Copy link
Contributor

hidmic commented Jan 4, 2022

I suspect @emersonknapp's hypothesis is spot on. E.g. there're plenty /chatter topic publishers in test_security tests that are terminated by launch in its current, not very nice manner.

@Blast545
Copy link
Contributor Author

Blast545 commented Jan 5, 2022

@hidmic What do you mean that these are not terminated in a nice manner?

@Blast545
Copy link
Contributor Author

Blast545 commented Jan 6, 2022

Running a Windows CycloneDDS job to get more info:
Build Status

@hidmic
Copy link
Contributor

hidmic commented Jan 6, 2022

What do you mean that these are not terminated in a nice manner?

@Blast545 On Windows, launch (and by extension launch_testing) stops processes by hard terminating them. There's no Unix signals equivalent there. Processes are not allowed to terminate gracefully, which sometimes causes problems. That may be the case here, but it's still a hypothesis. Haven't confirmed it yet.

@Blast545
Copy link
Contributor Author

Blast545 commented Feb 2, 2022

FYI this stopped appearing in the buildfarm after the default RMW was changed. See ros2/rmw#315. @hidmic @clalancette

@Blast545 Blast545 changed the title 👨‍🌾 Test regressions in the nightly_win_rel and nightly_win_deb jobs 👨‍🌾 rosbag2_transport Test regressions with cyclonedds Feb 2, 2022
@Blast545 Blast545 changed the title 👨‍🌾 rosbag2_transport Test regressions with cyclonedds 👨‍🌾 rosbag2_transport Test regressions with cyclonedds + flaky with fastrtps Apr 4, 2022
@Blast545
Copy link
Contributor Author

Blast545 commented Apr 4, 2022

@Blast545
Copy link
Contributor Author

I marked you as assigned and moved this to the "in progress" section of the Humble board mainly so others don't take a look into this right now @clalancette

@MichaelOrlov
Copy link
Contributor

@Blast545 @clalancette Can we close this issue or it is still appear on build farm?

@Blast545
Copy link
Contributor Author

Blast545 commented Feb 1, 2023

@Crola1702 Can you please take a look?

@Crola1702
Copy link
Contributor

Crola1702 commented Feb 1, 2023

Can we close this issue or it is still appear on build farm?

Last time it happened was on Windows release 2022-04-13. I will close this one as it's not happening anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants