Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify configs #550

Merged
merged 101 commits into from
Jan 31, 2025
Merged

Simplify configs #550

merged 101 commits into from
Jan 31, 2025

Conversation

aliberts
Copy link
Collaborator

@aliberts aliberts commented Dec 5, 2024

Blockers

What this does

This PR removes Hydra in favor of Draccus.

This brings significant changes to the codebase regarding how the configurations are built, saved, loaded and used. Most of the commands previously used won't work anymore but hopefully, you'll need to make minimal changes to most of them to make them work.

Overview

Configurations are now defined in the code through dataclasses rather than being defined in yaml files. The two main configuration classes are TrainPipelineConfig and EvalPipelineConfig. Similarly to yaml files previously, their code is heavily commented and is meant to be read in order to understand the options or see the default values.

Reading the updated examples/4_train_policy_with_script.md is a great way to have an overview of how this new config system works.

We've updated the following scripts, for which commands used before won't work:

  • lerobot/scripts/train.py
  • lerobot/scripts/eval.py
  • lerobot/scripts/control_robot.py
  • lerobot/scripts/visualize_image_transforms.py

Here are a few examples of commands before/after the changes.

Training Diffusion Policy on PushT - before
python lerobot/scripts/train.py \
    hydra.run.dir=outputs/train/diffusion_pusht \
    policy=diffusion \
    dataset_repo_id=lerobot/pusht \
    env=pusht \
    training.offline_steps=200000 \
    training.save_freq=20000 \
    training.eval_freq=20000 \
    eval.n_episodes=50 \
    wandb.enable=true \
    device=cuda
Training Diffusion Policy on PushT - after
python lerobot/scripts/train.py \
  --output_dir=outputs/train/diffusion_pusht \
  --policy.type=diffusion \
  --dataset.repo_id=lerobot/pusht \
  --env.type=pusht \
  --seed=100000 \
  --batch_size=64 \
  --offline.steps=200000 \
  --eval_freq=20000 \
  --save_freq=20000 \
  --wandb.enable=true \
  --device=cuda

Few things to note:

  • Some options were not present before and must be explicitly passed now. For example, --batch_size=64. This is because with the previous system, the batch_size value was included in the diffusion.yaml, which was implicitly selected with policy=diffusion. Now, the default batch_size is 8 and is independent of policy selection. Same idea for --seed here.
  • To select a policy or an environment, we now use the special argument .type. Read more about this here.
  • All options now include a -- prefix.
Evaluating ACT on Aloha Transfer Cube - before
python lerobot/scripts/eval.py \
  -p lerobot/act_aloha_sim_transfer_cube_human \
  eval.n_episodes=10 \
  eval.batch_size=10
Evaluating ACT on Aloha Transfer Cube - after
python lerobot/scripts/eval.py \
  --policy.path=lerobot/act_aloha_sim_transfer_cube_human \
  --env.type=aloha \
  --env.task=TransferCube-v0 \
  --eval.n_episodes=10 \
  --eval.batch_size=10
Running inference of a pretrained model on a SO-100 robot - before
python lerobot/scripts/control_robot.py record \
  --robot-path lerobot/configs/robot/so100.yaml \
  --fps 30 \
  --repo-id ${HF_USER}/eval_act_so100_lego \
  --single-task "Grasp a lego block and put it in the bin." \
  --tags tutorial \
  --warmup-time-s 1 \
  --episode-time-s 30 \
  --reset-time-s 30 \
  --num-episodes 10 \
  --push-to-hub 1 \
  -p outputs/train/act_so100_lego/checkpoints/last/pretrained_model
Running inference of a pretrained model on a SO-100 robot - after
python lerobot/scripts/control_robot.py \
  --robot.type=so100 \
  --control.type=record \
  --control.fps=30 \
  --control.repo_id=${HF_USER}/eval_act_so100_lego \
  --control.single_task="Grasp a lego block and put it in the bin." \
  --control.tags='["tutorial"]' \
  --control.warmup_time_s=1 \
  --control.episode_time_s=30 \
  --control.reset_time_s=30 \
  --control.num_episodes=10 \
  --control.push_to_hub=true \
  --control.policy.path=outputs/train/act_so100_lego/checkpoints/last/pretrained_model

Note that for the following scripts, we didn't update the argument parsing as we didn't feel they needed to. Therefore, these are mainly unaffected by these changes and the commands that worked before should still work with these:

  • lerobot/scripts/visualize_dataset.py
  • lerobot/scripts/visualize_dataset_html.py
  • lerobot/common/robot_devices/cameras/intelrealsense.py
  • lerobot/common/robot_devices/cameras/opencv.py
  • lerobot/scripts/configure_motor.py

Motivation

Our previous system for configurations had several limitations:

  • There was no dynamic link between the features of a dataset or an environment and a policy. This meant that whenever you needed to train on a different set of features from those hardcoded in the config, you needed to hack the config files in order to do so, which was confusing, cumbersome and error prone. In fact, we had to write a whole tutorial on how to do that.
  • Having configuration entirely defined in yaml files means that their deserialization can sometimes lead to unpredictable, or make configuration errors harder to spot. Moreover, the namespace/dictionaries returned have very little IDE support (things like autocomplete, jump-to-def, etc.)
  • While Hydra composition can be a powerful feature, it does come with a lot of complexity and the learning curve can be steep.
  • Overall goal is to simplify the workflow in the different use cases (training, evaluation, recording etc.) and make the scripts easier to use.

Changes

  • Adds config dataclasses for the scripts.
  • Moves a lot of the config validation logic that was previously in the scripts into the __post_init__ of these classes.
  • Adds a custom @parser.wrap() decorator similar to @draccus.wrap() to preprocess command line arguments to enable .path arguments (for policies only for now, e.g. --policy.path)
  • Replaces all Hydra function calls with their Draccus/custom wrapper/direct config instantiation counterpart:
# wrapper
- @hydra.main(version_base="1.2", config_name="default", config_path="../configs")
+ @parser.wrap()

# parser
- cfg = init_hydra_config(hydra_cfg_path, config_overrides)
+ cfg = draccus.parse(TrainPipelineConfig, config_path=config_path, args=cli_args)

# direct class instantiation
- cfg = init_hydra_config()
+ cfg = TrainPipelineConfig()
  • Adds HubMixin: A custom implementation of huggingface_hub.ModelHubMixin to better fit our needs (mostly, being able to serialize/deserialize using Draccus).
  • Adds PreTrainedConfig which policies config classes inherit from. Inspired by transformers.PretrainedConfig, this class will now manage a few common things amongst policy configs and will harmonize their interface.
  • Similiarly, adds PreTrainedPolicy which policy classes inherit from.
  • Removes the Policy Protocol in favor of directly using PreTrainedPolicy.
  • Link input/output shapes of policies to datasets:
    • parse_features_from_dataset
    • parse_features_from_env
  • Harmonizes optimizers and schedulers configs with OptimizerConfig and LRSchedulerConfig which create (through the build method) standard optimizers and schedulers from torch.optim or diffusers.optimization whenever possible, and custom implementation when needed.
  • Additionally, adds a use_policy_training_preset option (true by default) in the training config to allow fro selecting an optimizer/scheduler preset that comes with each policy config. Additionally, each PreTrainedPolicy implements get_optim_params() which returns a dict of parameters specific to that policy to be used by the optimizer (this is only used when use_policy_training_preset is true). This adresses the issues discussed in Move function make_optimizer_and_scheduler to policy #401
  • last symlink in checkpoints now points to the last checkpoint with a relative path (path was absolute before) which makes it easier to move things around.

TODO in future PR:

  • control_sim_robot.py -> we won't do it in this PR
  • Handle MultiLeRobotDataset

How it was tested

This PR allows to enable back a number of tests, including integration tests. Some datasets are still pulled from the hub by the tests but much less since can select a single episode since the datasets v2.

We will refactor the tests in a further PR to make them easier to write/maintain/scale. Notably, we can now remove most of tests/data (we'll still keep backwards compatibility test artifacts but that's okay since they're lightweight).

How to checkout & try? (for the reviewer)

Try out the new version of examples/4_train_policy_with_script.md

Sorry, something went wrong.

@aliberts aliberts self-assigned this Dec 5, 2024
@aliberts aliberts added 🔄 Refactor Refactoring 🔧 Config Change / add / remove configuration labels Dec 5, 2024
"takes precedence.",
)
# Use the checkpoint config instead of the provided config (but keep `resume` parameter).
self = checkpoint_cfg
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure this doesn't do what you want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed that doesn't work at all, I will handle that part once I'm done with the rest of the config (I'm on the policies right now, which is quite a big chunk).
Thanks for the heads up!

@aliberts aliberts force-pushed the user/aliberts/2024_11_30_remove_hydra branch from 87d92f9 to 06b604b Compare January 6, 2025 17:18
aliberts and others added 12 commits January 6, 2025 22:09
Co-authored-by: Simon Alibert <[email protected]>
aliberts and others added 13 commits January 28, 2025 09:25
Fix
This reverts commit aa65bb7.
@aliberts aliberts marked this pull request as ready for review January 29, 2025 15:24
Cadene and others added 3 commits January 29, 2025 20:47
…_30_remove_hydra
Copy link
Contributor

@tc-huang tc-huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @aliberts,
I noticed a few potential typos while reading examples/4_train_policy_with_script.md:

  • equiped → equipped (line 2)
  • dictionnaries → dictionaries (line 26)
  • exemple → example (line 45)

I hope this is helpful!

aliberts and others added 2 commits January 31, 2025 09:37
Co-authored-by: HUANG TZU-CHUN <[email protected]>
Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thorough shallow review

aliberts and others added 4 commits January 31, 2025 12:05
Co-authored-by: Remi <[email protected]>
…ggingface/lerobot into user/aliberts/2024_11_30_remove_hydra
@aliberts aliberts merged commit 3c0a209 into main Jan 31, 2025
7 checks passed
@aliberts aliberts deleted the user/aliberts/2024_11_30_remove_hydra branch January 31, 2025 12:57
menhguin pushed a commit to menhguin/lerobot that referenced this pull request Feb 9, 2025
Co-authored-by: Remi <[email protected]>
Co-authored-by: HUANG TZU-CHUN <[email protected]>
JIy3AHKO pushed a commit to vertix/lerobot that referenced this pull request Feb 27, 2025
Co-authored-by: Remi <[email protected]>
Co-authored-by: HUANG TZU-CHUN <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔧 Config Change / add / remove configuration 🔄 Refactor Refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants