Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸ‘¨β€πŸŒΎ Add a potential temporary fix for flaky tests on macOS #807

Closed
wants to merge 2 commits into from

Conversation

azeey
Copy link
Contributor

@azeey azeey commented May 5, 2021

🦟 Bug fix

Relates to #806

Summary

A potential solution to the problem of flaky tests on macOS is to include all known components in libignition-gazebo.so. I have tested this locally with Address Sanitizer and it seems to work.

Checklist

  • Signed all commits for DCO
  • Added tests
  • Updated documentation (as needed)
  • Updated migration guide (as needed)
  • codecheck passed (See contributing)
  • All tests passed (See test coverage)
  • While waiting for a review on your PR, please help review another open pull request to support the maintainers

Note to maintainers: Remember to use Squash-Merge

@github-actions github-actions bot added the 🏰 citadel Ignition Citadel label May 5, 2021
@chapulina chapulina added the macOS macOS support label May 5, 2021
@chapulina
Copy link
Contributor

include all known components in libignition-gazebo.so

Do you think we could get a similar effect by including them into the test executable? That would solve the test's issue while leaving the core library intact.


I can't see the results from CI, and it looks like 3 tests failed on Homebrew. I'll sync this with ign-gazebo3 to retrigger CI.

@azeey
Copy link
Contributor Author

azeey commented Jun 30, 2021

I think some, if not all, of the remaining failures might be fixed by gazebosim/gz-common#227.

@chapulina
Copy link
Contributor

So you think they're not related to how the components are registered, as explained on #806?

One thing all these test failures have in common is a segfault at startup, even before any console messages are printed.

@codecov
Copy link

codecov bot commented Jun 30, 2021

Codecov Report

Merging #807 (6cb1354) into ign-gazebo5 (b45f60f) will decrease coverage by 0.09%.
The diff coverage is 100.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           ign-gazebo5     #807      +/-   ##
===============================================
- Coverage        67.04%   66.95%   -0.10%     
===============================================
  Files              276      276              
  Lines            21049    21023      -26     
===============================================
- Hits             14112    14075      -37     
- Misses            6937     6948      +11     
Impacted Files Coverage Ξ”
src/SdfEntityCreator.cc 85.41% <ΓΈ> (ΓΈ)
src/SimulationRunner.cc 91.88% <100.00%> (-2.05%) ⬇️
src/Barrier.cc 96.15% <0.00%> (-3.85%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data
Powered by Codecov. Last update b45f60f...6cb1354. Read the comment docs.

@chapulina
Copy link
Contributor

FWIW, there were no homebrew failures in the latest run 😬

@azeey
Copy link
Contributor Author

azeey commented Jul 9, 2021

I've disabled multi-threading in 6ac5110. Let's what this does.

@chapulina
Copy link
Contributor

I've disabled multi-threading in 6ac5110. Let's what this does.

It's tough to tell from just one CI run because these failures are very flaky. It would be nice to be able to set something like colcon's --retest-until-fail on CI

@scpeters
Copy link
Member

I've disabled multi-threading in 6ac5110. Let's what this does.

It's tough to tell from just one CI run because these failures are very flaky. It would be nice to be able to set something like colcon's --retest-until-fail on CI

we currently have a script for running ctest with --rerun-failed if the RERUN_FAILED_TESTS variable is set to 1, though we only use it for gazebo CI

we could add something similar for --retest-until-fail

@azeey
Copy link
Contributor Author

azeey commented Jul 12, 2021

I've disabled multi-threading in 6ac5110. Let's what this does.

It's tough to tell from just one CI run because these failures are very flaky. It would be nice to be able to set something like colcon's --retest-until-fail on CI

we currently have a script for running ctest with --rerun-failed if the RERUN_FAILED_TESTS variable is set to 1, though we only use it for gazebo CI

we could add something similar for --retest-until-fail

Yeah, ctest has repeat-until-fail. How about doing that on a custom RTOOLS_BRANCH? Is gazebo-tooling/release-tools#242 still an issue?

@chapulina
Copy link
Contributor

How about doing that on a custom RTOOLS_BRANCH?

+1

Is gazebo-tooling/release-tools#242 still an issue?

I believe it is, but it's flaky and I haven't seen it in a while.

@azeey
Copy link
Contributor Author

azeey commented Jul 12, 2021

Buiding with --repeat-until-fail=100 Build Status

@azeey
Copy link
Contributor Author

azeey commented Jul 13, 2021

https://build.osrfoundation.org/job/ignition_gazebo-ci-pr_any-homebrew-amd64/5997/ ran each test until it failed with a maximum of 100 runs each. The logs show:

The following tests FAILED:
         39 - UNIT_PeerTracker_TEST (Failed)
        107 - INTEGRATION_network_handshake (Subprocess aborted)
        119 - INTEGRATION_scene_broadcaster_system (SEGFAULT)

I suspect the failure in UNIT_PeerTracker_TEST might be a timing/networking issue (#452). The INTEGRATION_network_handshake test uses multiple threads so it might be a race condition on console stream objects (gazebosim/gz-common#227). I don't have any guesses as to what's causing the segfault in INTEGRATION_scene_broadcaster_system

For comparision, I'll run ign-gazebo3 with --repeat-until-fail and report back.

@azeey
Copy link
Contributor Author

azeey commented Jul 13, 2021

In the interest of time, I only did --repeat-until-fail=10 for the ign-gazebo3 branch: https://build.osrfoundation.org/view/ign-citadel/job/ignition_gazebo-ci-pr_any-homebrew-amd64/6020

The logs show:

The following tests FAILED:
	 39 - UNIT_PeerTracker_TEST (Failed)
	 59 - INTEGRATION_breadcrumbs (SEGFAULT)
	 71 - INTEGRATION_each_new_removed (Subprocess aborted)
	 93 - INTEGRATION_level_manager_runtime_performers (SEGFAULT)
	111 - INTEGRATION_physics_system (SEGFAULT)
	115 - INTEGRATION_pose_publisher_system (SEGFAULT)
	119 - INTEGRATION_scene_broadcaster_system (SEGFAULT)

I'm guessing INTEGRATION_network_handshake would have failed to if I had run it 100 times.

@chapulina
Copy link
Contributor

So we're testing 2 different things at once here. The header includes and the worker threads. I think we should examine them separately.

If it turns out the worker threads are the problem, I believe the next step could be to disable multi-threading just for macOS until we have time to try and address the underlying issue. I'm assuming no one has time to debug these failures on a mac right now. I think it's worth sacrificing performance for reliability for now (for Mac users and CI).

@azeey
Copy link
Contributor Author

azeey commented Feb 25, 2022

Now that ign-common 4.5 has been released, we've have had much better homebrew CI https://build.osrfoundation.org/job/ignition_gazebo-ci-ign-gazebo5-homebrew-amd64/ (starting from build 53).

image
Last 20 builds

INTEGRATION_scene_broadcaster_system segfaulted in build 55 and UNIT_PeerTracker_TEST failed in build 58. I think this is very encouraging. I'll retarget this to ign-gazebo5 for now and run tests to see if the combination of this PR and gazebosim/gz-common#227 show an improvement.

I have attached the test result csv file for posterity: Test Results.csv

@azeey azeey changed the base branch from ign-gazebo3 to ign-gazebo5 February 25, 2022 17:31
@chapulina chapulina added 🏒 edifice Ignition Edifice and removed 🏰 citadel Ignition Citadel labels Mar 24, 2022
@chapulina
Copy link
Contributor

@azeey , Edifice will EOL next week. Do you want to retarget this to ign-gazebo6?

@azeey
Copy link
Contributor Author

azeey commented Mar 25, 2022

@azeey , Edifice will EOL next week. Do you want to retarget this to ign-gazebo6?

Yup. Will do.

@chapulina chapulina mentioned this pull request Jun 20, 2022
8 tasks
@chapulina chapulina added the bug Something isn't working label Jul 23, 2022
@azeey azeey mentioned this pull request Sep 22, 2022
8 tasks
@azeey
Copy link
Contributor Author

azeey commented Dec 22, 2022

Closing in favor of #1836

@azeey azeey closed this Dec 22, 2022
@azeey azeey deleted the fix_flaky_macos branch December 22, 2022 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working 🏒 edifice Ignition Edifice macOS macOS support
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants