haosulab · StoneT2000 · Aug 19, 2024 · Aug 8, 2024 · Aug 8, 2024 · Aug 9, 2024
diff --git a/docs/source/contributing/tasks.md b/docs/source/contributing/tasks.md
@@ -68,7 +68,7 @@ class PushCube(BaseEnv):
     @property
     def _default_sim_config(self):
         return SimConfig(
-            gpu_memory_cfg=GPUMemoryConfig(
+            gpu_memory_config=GPUMemoryConfig(
                 found_lost_pairs_capacity=2**25, max_rigid_patch_count=2**18
             )
         )
@@ -83,15 +83,15 @@ class RotateSingleObjectInHand(BaseEnv):
     @property
     def _default_sim_config(self):
         return SimConfig(
-            gpu_memory_cfg=GPUMemoryConfig(
+            gpu_memory_config=GPUMemoryConfig(
                 max_rigid_contact_count=self.num_envs * max(1024, self.num_envs) * 8,
                 max_rigid_patch_count=self.num_envs * max(1024, self.num_envs) * 2,
                 found_lost_pairs_capacity=2**26,
             )
         )
 ```
 
-For GPU simulation tuning, there are generally two considerations, memory and speed. It is recommended to set `gpu_memory_cfg` in such a way so that no errors are outputted when simulating as many as `4096` parallel environments with state observations on a single GPU. 
+For GPU simulation tuning, there are generally two considerations, memory and speed. It is recommended to set `gpu_memory_config` in such a way so that no errors are outputted when simulating as many as `4096` parallel environments with state observations on a single GPU. 
 
 A simple way to test is to run the GPU sim benchmarking script on your already registered environment and check if any errors are reported
 
@@ -126,5 +126,5 @@ Examples of task cards are found throughout the [task documentation](../tasks/in
 When contributing the task, make sure you do the following:
 
 - The task code itself should have a reasonable unique name and be placed in `mani_skill/envs/tasks`.
-- Added a demo video of the task being solved successfully (for each variation if there are several) to `figures/environment_demos`. The video should have ray-tracing on so it looks nicer! This can be done by replaying a trajectory with `shader_dir="rt"` passed into `gym.make` when making the environment.
+- Added a demo video of the task being solved successfully (for each variation if there are several) to `figures/environment_demos`. The video should have ray-tracing on so it looks nicer! This can be done by replaying a trajectory with `human_render_camera_configs=dict(shader_pack="rt")` passed into `gym.make` when making the environment.
 - Added a task card to `docs/source/tasks/index.md`.
diff --git a/docs/source/user_guide/concepts/gpu_simulation.md b/docs/source/user_guide/concepts/gpu_simulation.md
@@ -6,7 +6,7 @@ ManiSkill leverages [PhysX](https://github.com/NVIDIA-Omniverse/PhysX) to perfor
 
 With GPU parallelization, the concept is that one can simulate a task thousands of times at once per GPU. In ManiSkill/SAPIEN this is realized by effectively putting all actors and articulations <span style="color:#F1A430">**into the same physx scene**</span> and give each task it's own small workspace in the physx scene known as a <span style="color:#0086E7">**sub-scene**</span>. 
 
-The idea of sub-scenes is that reading data of e.g. actor poses is automatically pre-processed to be relative to the center of the sub-scene and not the physx scene. The diagram below shows how 64 sub-scenes may be organized. Note that each sub-scene's distance to each other is defined by the simulation configuration `sim_cfg.spacing` value which can be set when building your own task.
+The idea of sub-scenes is that reading data of e.g. actor poses is automatically pre-processed to be relative to the center of the sub-scene and not the physx scene. The diagram below shows how 64 sub-scenes may be organized. Note that each sub-scene's distance to each other is defined by the simulation configuration `sim_config.spacing` value which can be set when building your own task.
 
 :::{figure} images/physx_scene_subscene_relationship.png 
 :::

diff --git a/docs/source/user_guide/concepts/observation.md b/docs/source/user_guide/concepts/observation.md
@@ -1,22 +1,20 @@
 # Observation
 
-<!-- See our [colab tutorial](https://colab.research.google.com/github/haosulab/ManiSkill/blob/main/examples/tutorials/customize_environments.ipynb#scrollTo=NaSQ7CD2sswC) for how to customize cameras. -->
-
 ## Observation mode
 
 **The observation mode defines the observation space.**
 All ManiSkill tasks take the observation mode (`obs_mode`) as one of the input arguments of `__init__`.
 In general, the observation is organized as a dictionary (with an observation space of `gym.spaces.Dict`).
 
-There are two raw observations modes: `state_dict` (privileged states) and `sensor_data` (raw sensor data like visual data without postprocessing). `state` is a flat version of `state_dict`. `rgbd` and `pointcloud` apply post-processing on `sensor_data` to give convenient representations of visual data.
+There are two raw observations modes: `state_dict` (privileged states) and `sensor_data` (raw sensor data like visual data without postprocessing). `state` is a flat version of `state_dict`. `rgb+depth`, `rgb+depth+segmentation` (or any combination of `rgb`, `depth`, `segmentation`), and `pointcloud` apply post-processing on `sensor_data` to give convenient representations of visual data.
 
 The details here show the unbatched shapes. In general there is always a batch dimension unless you are using CPU simulation. Moreover, we annotate what dtype some values are, where some have both a torch and numpy dtype depending on whether you are using GPU or CPU simulation repspectively.
 
 ### state_dict
 
 The observation is a dictionary of states. It usually contains privileged information such as object poses. It is not supported for soft-body tasks.
 
-- `agent`: robot proprioception
+- `agent`: robot proprioception (return value of a task's `_get_obs_agent` function)
   - `qpos`: [nq], current joint positions. *nq* is the degree of freedom.
   - `qvel`: [nq], current joint velocities
   <!-- - `base_pose`: [7], robot position (xyz) and quaternion (wxyz) in the world frame -->
@@ -29,7 +27,7 @@ It is a flat version of *state_dict*. The observation space is `gym.spaces.Box`.
 
 ### sensor_data
 
-In addition to `agent` and `extra`, `sensor_data` and `sensor_param` are introduced.
+In addition to `agent` and `extra`, `sensor_data` and `sensor_param` are introduced. At the moment there are only Camera type sensors. Cameras are special in that they can be run with different choices of shaders. The default shader is called `minimal` which is the fastest and most memory efficient option. The shader chosen determines what data is stored in this observation mode. We describe the raw data format for the `minimal` shader here. Detailed information on how sensors/cameras can be customized can be found in the [sensors](../tutorials/sensors/index.md) section.
 
 - `sensor_data`: data captured by sensors configured in the environment
   - `{sensor_uid}`:
@@ -46,7 +44,7 @@ In addition to `agent` and `extra`, `sensor_data` and `sensor_param` are introdu
     - `extrinsic_cv`: [4, 4], camera extrinsic (OpenCV convention)
     - `intrinsic_cv`: [3, 3], camera intrinsic (OpenCV convention)
 
-### rgbd
+### rgb+depth+segmentation
 
 This observation mode has the same data format as the [sensor_data mode](#sensor_data), but all sensor data from cameras are replaced with the following structure
 
@@ -58,9 +56,10 @@ This observation mode has the same data format as the [sensor_data mode](#sensor
     - `depth`: [H, W, 1], `torch.int16, np.uint16`. The unit is millimeters. 0 stands for an invalid pixel (beyond the camera far).
     - `segmentation`: [H, W, 1], `torch.int16, np.uint16`. See the [Segmentation data section](#segmentation-data) for more details.
 
-    Otherwise keep the same data without any additional processing as in the sensor_data mode
+Note that this data is not scaled/normalized to [0, 1] or [-1, 1] in order to conserve memory, so if you consider to train on RGB, depth, and/or segmentation data be sure to scale your data before training on it.
+
 
-Note that this data is not scaled/normalized to [0, 1] or [-1, 1] in order to conserve memory, so if you consider to train on RGBD data be sure to scale your data before training on it.
+ManiSkill by default flexibly supports different combinations of RGB, depth, and segmentation data, namely `rgb`, `depth`, `segmentation`, `rgb+depth`, `rgb+depth+segmentation`, `rgb+segmentation`, and`depth+segmentation`. (`rgbd` is a short hand for `rgb+depth`). Whichever image modality that is not chosen will not be included in the observation and conserves some memory and GPU bandwith.
 
 The RGB and depth data visualized can look like below:
 ```{image} images/replica_cad_rgbd.png
@@ -69,6 +68,8 @@ alt: RGBD from two cameras of Fetch robot inside the ReplicaCAD dataset scene
 ---
 ```
 
+
+
 ### pointcloud
 This observation mode has the same data format as the [sensor_data mode](#sensor_data), but all sensor data from cameras are removed and instead a new key is added called `pointcloud`.
 

diff --git a/docs/source/user_guide/getting_started/images/parallel_gui_render.png b/docs/source/user_guide/getting_started/images/parallel_gui_render.png
diff --git a/docs/source/user_guide/getting_started/quickstart.md b/docs/source/user_guide/getting_started/quickstart.md
@@ -86,7 +86,7 @@ For the full documentation of options you can provide for gym.make see the [docs
 
 ## GPU Parallelized/Vectorized Tasks
 
-ManiSkill is powered by SAPIEN which supports GPU parallelized physics simulation and GPU parallelized rendering. This enables achieving 200,000+ state-based simulation FPS and 10,000+ FPS with rendering on a single 4090 GPU on a e.g. manipulation tasks. The FPS can be higher or lower depending on what is simulated. For full benchmarking results see [this page](../additional_resources/performance_benchmarking)
+ManiSkill is powered by SAPIEN which supports GPU parallelized physics simulation and GPU parallelized rendering. This enables achieving 200,000+ state-based simulation FPS and 30,000+ FPS with rendering on a single 4090 GPU on a e.g. manipulation tasks. The FPS can be higher or lower depending on what is simulated. For full benchmarking results see [this page](../additional_resources/performance_benchmarking)
 
 In order to run massively parallelized tasks on a GPU, it is as simple as adding the `num_envs` argument to `gym.make` as so
 
@@ -137,7 +137,7 @@ which will look something like this
 
 ### Parallel Rendering in one Scene
 
-We further support via recording or GUI to view all parallel environments at once, and you can also turn on ray-tracing for more photo-realism. Note that this feature is not useful for any practical purposes (for e.g. machine learning) apart from generating cool demonstration videos and so it is not well optimized.
+We further support via recording or GUI to view all parallel environments at once, and you can also turn on ray-tracing for more photo-realism. Note that this feature is not useful for any practical purposes (for e.g. machine learning) apart from generating cool demonstration videos.
 
 Turning the parallel GUI render on simply requires adding the argument `parallel_in_single_scene` to `gym.make` as so
 
@@ -151,7 +151,7 @@ env = gym.make(
     control_mode="pd_joint_delta_pos",
     num_envs=16,
     parallel_in_single_scene=True,
-    shader_dir="rt-fast" # optionally set this argument for more photo-realistic rendering
+    viewer_camera_configs=dict(shader_pack="rt-fast"),
 )
 ```
 
@@ -170,7 +170,7 @@ We currently do not properly support exposing multiple visible CUDA devices to a
 
 Each ManiSkill task supports different **observation modes** and **control modes**, which determine its **observation space** and **action space**. They can be specified by `gym.make(env_id, obs_mode=..., control_mode=...)`.
 
-The common observation modes are `state`, `rgbd`, `pointcloud`. We also support `state_dict` (states organized as a hierarchical dictionary) and `sensor_data` (raw visual observations without postprocessing). Please refer to [Observation](../concepts/observation.md) for more details.
+The common observation modes are `state`, `rgbd`, `pointcloud`. We also support `state_dict` (states organized as a hierarchical dictionary) and `sensor_data` (raw visual observations without postprocessing). Please refer to [Observation](../concepts/observation.md) for more details. Furthermore, visual data generated by the simulator can be modified in many ways via shaders. Please refer to [the sensors/cameras tutorial](../tutorials/sensors/index.md) for more details.
 
 We support a wide range of controllers. Different controllers can have different effects on your algorithms. Thus, it is recommended to understand the action space you are going to use. Please refer to [Controllers](../concepts/controllers.md) for more details.
 

diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md
@@ -43,6 +43,7 @@ datasets/index
 data_collection/index
 reinforcement_learning/index
 learning_from_demos/index
+wrappers/index
 ```
 
 ```{toctree}

diff --git a/docs/source/user_guide/reinforcement_learning/setup.md b/docs/source/user_guide/reinforcement_learning/setup.md
@@ -1,5 +1,10 @@
 # Setup
 
+This page documents key things to know when setting up ManiSkill environments for reinforcement learning, including:
+
+- How to convert ManiSkill environments to gymnasium API compatible environments, both [single](#gym-environment-api) and [vectorized](#gym-vectorized-environment-api) APIs.
+- [Useful Wrappers](#useful-wrappers)
+
 ManiSkill environments are created by gymnasium's `make` function. The result is by default a "batched" environment where every input and output is batched. Note that this is not standard gymnasium API. If you want the standard gymnasium environemnt / vectorized environment API see the next sections.
 
 ```python
@@ -56,3 +61,14 @@ You may also notice that there are two additional options when creating a vector
 
 Note that for efficiency, everything returned by the environment will be a batched torch tensor on the GPU and not a batched numpy array on the CPU. This the only difference you may need to account for between ManiSkill vectorized environments and gymnasium vectorized environments.
 
+## Useful Wrappers
+
+RL practitioners often use wrappers to modify and augment environments. These are documented in the [wrappers](../wrappers/index.md) section. Some commonly used ones include:
+- [RecordEpisode](../wrappers/record.md) for recording videos/trajectories of rollouts.
+- [FlattenRGBDObservations](../wrappers/flatten.md#flatten-rgbd-observations) for flattening the `obs_mode="rgbd"` or `obs_mode="rgb+depth"` observations into a simple dictionary with just a combined `rgbd` tensor and `state` tensor.
+
+## Common Mistakes / Gotchas
+
+In old environments/benchmarks, people often have used `env.render(mode="rgb_array")` or `env.render()` to get image inputs for RL agents. This is not correct because image observations are returned by `env.reset()` and `env.step()` directly and `env.render` is just for visualization/video recording only in ManiSkill.
+
+For robotics tasks observations often are composed of state information (like robot joint angles) and image observations (like camera images). All tasks in ManiSkill will specifically remove certain priviliged state information from the observations when the `obs_mode` is not `state` or `state_dict` like ground truth object poses. Moreover, the image observations returned by `env.reset()` and `env.step()` are usually from cameras that are positioned in specific locations to provide a good view of the task to make it solvable.
diff --git a/docs/source/user_guide/tutorials/custom_tasks/advanced.md b/docs/source/user_guide/tutorials/custom_tasks/advanced.md
@@ -168,7 +168,7 @@ In the drop down below is a copy of all the configurations possible
 :::{dropdown} All sim configs
 :icon: code
 
-```
+```python
 @dataclass
 class GPUMemoryConfig:
     """A gpu memory configuration dataclass that neatly holds all parameters that configure physx GPU memory for simulation"""
@@ -232,16 +232,16 @@ class DefaultMaterialsConfig:
 
 @dataclass
 class SimConfig:
-    spacing: int = 5
+    spacing: float = 5
     """Controls the spacing between parallel environments when simulating on GPU in meters. Increase this value
     if you expect objects in one parallel environment to impact objects within this spacing distance"""
     sim_freq: int = 100
     """simulation frequency (Hz)"""
     control_freq: int = 20
     """control frequency (Hz). Every control step (e.g. env.step) contains sim_freq / control_freq physx simulation steps"""
-    gpu_memory_cfg: GPUMemoryConfig = field(default_factory=GPUMemoryConfig)
-    scene_cfg: SceneConfig = field(default_factory=SceneConfig)
-    default_materials_cfg: DefaultMaterialsConfig = field(
+    gpu_memory_config: GPUMemoryConfig = field(default_factory=GPUMemoryConfig)
+    scene_config: SceneConfig = field(default_factory=SceneConfig)
+    default_materials_config: DefaultMaterialsConfig = field(
         default_factory=DefaultMaterialsConfig
     )
 
@@ -259,7 +259,7 @@ class MyCustomTask(BaseEnv):
     @property
     def _default_sim_config(self):
         return SimConfig(
-            gpu_memory_cfg=GPUMemoryConfig(
+            gpu_memory_config=GPUMemoryConfig(
                 max_rigid_contact_count=self.num_envs * max(1024, self.num_envs) * 8,
                 max_rigid_patch_count=self.num_envs * max(1024, self.num_envs) * 2,
                 found_lost_pairs_capacity=2**26,

diff --git a/docs/source/user_guide/tutorials/index.md b/docs/source/user_guide/tutorials/index.md
@@ -10,6 +10,7 @@ For those looking for a quickstart/tutorial on Google Colab, checkout the [quick
 
 custom_tasks/index
 custom_robots
+sensors/index
 custom_reusable_scenes
 domain_randomization
 ```
diff --git a/docs/source/user_guide/tutorials/sensors/index.md b/docs/source/user_guide/tutorials/sensors/index.md
@@ -0,0 +1,37 @@
+# Sensors / Cameras
+
+This page documents how to use / customize sensors and cameras in ManiSkill in depth at runtime and in task/environment definitions. In ManiSkill, sensors are "devices" that can capture some modality of data. At the moment there is only the Camera sensor type.
+
+## Cameras
+
+Cameras in ManiSkill can capture a ton of different modalities of data. By default ManiSkill limits those to just `rgb`, `depth`, `position` (which is used to derive depth), and `segmentation`. Internally ManiSkill uses [SAPIEN](https://sapien.ucsd.edu/) which has a highly optimized rendering system that leverages shaders to render different modalities of data.
+
+Each shader has a preset configuration that generates textures containing data in a image format, often in a somewhat difficult to use format due to heavy optimization. ManiSkill uses a shader configuration system in python that parses these different shaders into more user friendly formats (namely the well known `rgb`, `depth`, `position`, and `segmentation` type data). This shader config system resides in this file on [Github](https://github.com/haosulab/ManiSkill/blob/main/mani_skill/render/shaders.py) and defines a few friendly defaults for minimal/fast rendering and ray-tracing.
+
+
+Every ManiSkill environment will have 3 categories of cameras (although some categories can be empty): sensors for observations for agents/policies, human_render_cameras for (high-quality) video capture for humans, and a single viewing camera which is used by the GUI application to render the environment.
+
+
+At runtime when creating environments with `gym.make`, you can pass runtime overrides to any of these cameras as so. Below changes human render cameras to use the ray-tracing shader for photorealistic rendering, modifies sensor cameras to have width 320 and height 240, and changes the viewer camera to have a different field of view value.
+
+```python
+gym.make("PickCube-v1",
+  sensor_configs=dict(width=320, height=240),
+  human_render_camera_configs=dict(shader_pack="rt"),
+  viewer_camera_configs=dict(fov=1),
+)
+```
+
+These overrides will affect every camera in the environment in that group. So `sensor_configs=dict(width=320, height=240)` will change the width and height of every sensor camera in the environment, but will not affect the human render cameras or the viewer camera.
+
+To override specific cameras, you can do it by camera name. For example, if you want to override the sensor camera with name `camera_0` to have a different width and height, you can do it as so:
+
+```python
+gym.make("PickCube-v1",
+  sensor_configs=dict(camera_0=dict(width=320, height=240)),
+)
+```
+
+Now all other sensor cameras will have the default width and height, and `camera_0` will have the specified width and height.
+
+These specific customizations can be useful for those looking to customize how they render or generate policy observations to suit their needs.