Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory management issues in DART BoxedLcpConstrainedSolver? #48

Closed
osrf-migration opened this issue Feb 5, 2020 · 5 comments
Closed
Labels
bug Something isn't working physics Involves Ignition Physics

Comments

@osrf-migration
Copy link

Original report (archived issue) by Jaldert Rombouts (Bitbucket: Jaldert).

The original report had attachments: conveyor_with_box_crash.gif, conveyor_with_box_experiments.tgz


Prerequisites

  • Put an X between the brackets on this line if you have done all of the following:

Description

Thanks for building Ignition Gazebo. I've been experimenting with it and I really like the new architecture.

One of my experiments was building a conveyor model and system plugin. The model consists of a box with a set of rollers (cylinders) with revolute joints on top. I've written a custom system plugin to set JointVelocityCmds for all these rollers.

The model runs fine when it is alone in the world - I've left it running for hours without issues.

However, I’m running into issues when I drop a box on top of the running conveyor: I quickly get memory errors like free(): invalid next size (fast) and occasionally a segmentation fault or double free or corruption (out). Interestingly, this issue seems to require a combination of the conveyor running and the box moving along it. I tried variations where the model is not running and the box is dropped/placed at similar poses where the simulation crashes and I didn't manage to crash it in this manner.

Taking the GDB backtrace at face value, it seems to be an issue in DART: In particular it seems to happen in constraint::BoxedLcpConstraintSolver::solveConstrainedGroup(dart::constraint::ConstrainedGroup&) () from /usr/lib/x86_64-linux-gnu/libdart.so.6.10 (link)


(see attachment for high resolution version).

I was not entirely sure if this issue should be posted here or on ign-physics . I’ve posted it here because reproduction requires ign-gazebo , but I’m also happy to open the issue elsewhere.

Steps to Reproduce

  1. Build the RevoluteConveyorController.so library
# Gist with plugin system, and example world SDF
git clone https://gist.github.com/84eaf1c8006e80a4afa30858b296f681.git example

cd example
cmake -H. -Bbuild
cmake --build build

# This step copies plugin to ~/.ignition/gazebo/plugins s.t. dynamic loader can find it.
cmake --build build --target install

2. Run the conveyor_with_box.sdf example

ign gazebo conveyor_with_box.sdf # with gui
# ign gazebo -s -r conveyor_with_box.sdf # headless

Expected behavior:

The box should drop onto the rollers, move along it and finally fall off the end. The simulation should keep running.

Actual behavior:

The box drops, and starts moving along conveyor, then ignition gazebo crashes with memory errors like free(): invalid next size (fast) and occasionally a segmentation fault or double free or corruption (out).

Reproduces how often:

Always. There is variation in the exact timing to crash (looking at the number of simulation iterations) as well as the exact error (see detailed description above).

Versions

  • OS: Ubuntu 18.04 Bionic
  • Ignition Version: Clean build of all Ignition Citadel packages, following install_ubuntu_src.

Additional Information

  • Virtual machine: No, running native.

@osrf-migration
Copy link
Author

Original comment by Addisu Z. Taddese (Bitbucket: azeey, GitHub: azeey).


Thanks Jaldert Rombouts (Jaldert) for the reproducible example. I looked at your .sdf file to see if there were things that were not setup properly before delving into DART. I found that doing the following prevents the crash from happening:

  1. Remove the fixed joint between world and the base_link of the conveyor model
  2. Move conveyor vertically so that it's not in collision with the ground_plane
  3. Add correct <inertial> parameters to all of the links in the conveyor model

Here is the new file: https://gist.github.com/azeey/cad3bb49f0e096f1e90b8f9adfd3ce4b

And this is the result:

Peek 2020-03-24 22-35.gif

That being said, I don't think ign-gazebo should be crashing given the .sdf file you provided, so we'll keep this issue open and investigate the cause of the crash.

@osrf-migration
Copy link
Author

Original comment by Michael Grey (Bitbucket: mxgrey, GitHub: mxgrey).


If there’s a collision constraint violation between static objects (i.e. a fundamentally unsolvable scenario), I’m not especially surprised that a catastrophic crash is occurring. I can imagine that kind of scenario may produce NaN values in the simulation state, and those NaN values may spill out to other parts of the program, eventually causing some kind of undefined behavior or unhandled excpetion.

It would be preferable for the simulation to kindly explain what the problem is, but it’s possible that the amount of sanity checks needed throughout the simulation pipeline to figure that out would be prohibitively expensive. Maybe dartsim could have more assertions to check for NaN values.

@osrf-migration
Copy link
Author

Original comment by Jaldert Rombouts (Bitbucket: Jaldert).


@azeey

Thank you very much for looking into this and for your helpful response! I can confirm that your suggestions fix the issue for the model I shared.

Now I'm trying to understand exactly what parameters are relevant and why, so that I can avoid similar mistakes in future models.

To do this systematically, I've instrumented a templated model where I'm varying the following:

  • Number of “rollers” (9, 10, 11).
  • Adding fixed joint (True, False).
  • Adding inertial (True, False). I've assumed uniform density, 10,000 kg/m^3. This gives higher numbers than yours, but the model doesn’t seem too sensitive to scaling these.
  • z_offset (0.0, 0.025, 0.05). With 0.0 the model should be touching the ground exactly.

Your recommended configuration is:

  • fixed_joint : False
  • inertial : True
  • z_offset : 0.0

Running over all 36 (3 x 2 x 2 x 3) combinations yields the following table (successes bolded):

n_rollers fixed_joint inertial z_offset success comment
9 True False 0.0 False (1)
10 True False 0.0 False (1)
11 True False 0.0 False (1)
9 False False 0.0 True
10 False False 0.0 True
11 False False 0.0 False (2)
9 True True 0.0 False (3)
10 True True 0.0 False (2)
11 True True 0.0 False (1)
9 False True 0.0 True
10 False True 0.0 True (4)
11 False True 0.0 False (2)
9 True False 0.025 False (1)
10 True False 0.025 False (1)
11 True False 0.025 False (1)
9 False False 0.025 False (2)
10 False False 0.025 False (2)
11 False False 0.025 False (2)
9 True True 0.025 False (3)
10 True True 0.025 False (2)
11 True True 0.025 False (3)
9 False True 0.025 True
10 False True 0.025 True
11 False True 0.025 False (2)
9 True False 0.05 False (1)
10 True False 0.05 False (1)
11 True False 0.05 False (1)
9 False False 0.05 False (2)
10 False False 0.05 False (2)
11 False False 0.05 False (2)
9 True True 0.05 False (2)
10 True True 0.05 False (2)
11 True True 0.05 False (1)
9 False True 0.05 True
10 False True 0.05 True
11 False True 0.05 False (2)

Footnotes:

(1) Fails after a few seconds.

(2) Fails almost immediately.

(3) Fails after/around box hits.

(4) Recommended configuration for the original model.

Conclusions:

  • Adding a fixed joint always causes failure.
  • Inertial is only relevant when model starts with z_offset > 0.0 (i.e. it first needs to drop). This makes sense to me, assuming Gazebo does not automatically derive any inertial parameters.
  • Pushing up n_rollers by one always causes failure. Decreasing it never causes failure when all other parameters are equal to a "good" configuration. (This is for the cases I’ve tested, not exhaustive.)

This gives rise to some follow-up questions:

  • Is there anything else wrong with the SDF model that would explain why upping the number of rollers would cause it to crash? You can find the base template for just the conveyor model here (Jinja2 format, similar to ERB, but easier for me to script) since that is probably a lot easier to eyeball. If you're interested, I'm also happy to share the generator that plugs values into the template and renders them.
  • Why does adding a fixed joint always cause a crash for this model? In particular, I’m using fixed joints for other models seemingly without issues.
  • How did you come up with the fixes to the SDF? Are there guides that can help expand my understanding, or is this purely based on experience?

@mxgrey

Your hypothesis sounded valid to me, so I also ran the configurations with a z-offset of -5cm with the expectation that the model would always crash. However, that is not what I observed. In fact, in this case, inertial can be left unspecified, and thus only fixed_joint and n_rollers affect the success/failure of running the world. This shows that it is possible for a simulation with fundamentally unsolvable collision constraints to run fine.

From a user perspective, auto-checking for common violations and auto-generating (missing) parameters such as inertial matrices could make modelling more fool-proof. An example: (Py)Bullet by default recomputes the inertial matrices when loading from URDF/SDF: loadURDF see the "URDF_USE_INERTIA_FROM_FILE" flag. I'm guessing this was done because many model files contain mistakes in the inertial parameters. The downside of this is that those matrices might be wrong in more subtle ways (e.g. for non-uniform mass distribution) that keep the simulation running but produce unrealistic results. Generating some output on exactly what gets auto-computed could be a solution to that.

I have attached conveyor_with_box_experiments.tgz in a separate comment (couldn’t find a way to attach it to this comment). This archive contains all SDFs for the table.

I'm happy to provide any further information!

@osrf-migration
Copy link
Author

Original comment by Jaldert Rombouts (Bitbucket: Jaldert).


  • set attachment to "conveyor_with_box_experiments.tgz"

SDFs for running experiments in table (as well as negative z-offsets not shown in table).

@osrf-migration osrf-migration added major bug Something isn't working labels Apr 15, 2020
@chapulina chapulina added physics Involves Ignition Physics and removed major labels Apr 29, 2020
@azeey
Copy link
Contributor

azeey commented Jul 7, 2020

I believe this has been fixed by gazebo-forks/dart#6, which is the fork of DART used by ign-gazebo. I ran all the SDF files in conveyor_with_box_experiments.tgz without any failures. Thank you for creating a systematic set of configurations for testing this.

@azeey azeey closed this as completed Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working physics Involves Ignition Physics
Projects
None yet
Development

No branches or pull requests

3 participants