polish(xjx): 1.0 (opendilab#143)

* polish(xjx): change doc structure, add intro * Replace head image * polish(xjx): system, basic, middleware (opendilab#117) * Add middleware * Add system desigin * Add middleware spec * Change search background * Add quick start * Use pytorch theme * Adjusting grammar and errors * New logo * doc(hansbug): add guides for unittest, visualization and code style. (opendilab#127) * doc(hansbug): add 3 new pages * doc(hansbug): add code style page * dev(hansbug): add plantuml's documentation * dev(hansbug): add note * dev(hansbug): align the image to center * dev(hansbug): add graphviz's documentation * dev(hansbug): add documentation for draw.io * dev(hansbug): fix problem on draw.io * dev(hansbug): add introduction for snakeviz * dev(hansbug): add introduction for snakeviz * dev(hansbug): add former parts of unittest * fix(hansbug): do some fix * dev(hansbug): add writing guide for unittest * fix(hansbug): fix bug of * dev(hansbug): add running guide for unittest * fix(hansbug): fix the last code block * dev(hansbug): append features to visualization * dev(hansbug): add code style guide * dev(hansbug): move the docs to new path * fix(hansbug): fix the problems in chinese pages * fix(hansbug): use english tutorials * feature(pu): add config_spec_zh, basic_rl_zh, exploration_rl_zh, imitation_learning_zh (opendilab#126) * feature(pu): add config_spec_zh * feature(pu): add config_spec_zh, basic_rl_zh, exploration_rl_zh, imitation_learning_zh * polish(pu):polish index * polish(pu): polish style * polish(zlx): 24-cooperation (opendilab#122) * polish(zlx): Init 24-cooperation(git + issue/pr) * polish(zlx): Add git_guide and issue_pr * fix(zlx): fix comments by xjx * feature(zlx): Add en version of 24-cooperation * polish(zlx): fix comments by xjx * polish(zjow): polish and revise quickstart and installation. (opendilab#121) * Polish Quickstart. * Minor change. * Minor change. * polish(zlx): 13. envs (opendilab#118) * polish(zlx): Move env to new place. Polish index and images * polish(zlx): Modify image scale * polish(zlx): Add space in zh version * polish(zlx): Move env to new place. Polish index and images * polish(zlx): Modify image scale * fixbug(zlx): en index indent * polish(zlx): polish format via make live * polish(zlx): fix comments by xjx Co-authored-by: zhaoliangxuan <[email protected]> * doc(nyz): add distributed rl overview (opendilab#133) * doc(nyz): add distributed rl overview * polish(nyz): polish footnote and note * doc(davide): transfer 12 policies (opendilab#120) * filled index en and ch * Update index.rst * added dqn_zh in index * doc(zms): 11_dizoo: add zh + en version of index (opendilab#130) * 1st zh doc * change * change links * add note * draft version of en dizoo * change a bit * final version * Update index.rst * polish(nyz): add missing images and polish doc * doc(lxl): 02_algo: add offline rl zh (opendilab#125) * polish(lxl): fix gramma and typo * resolve conflicts when changing branchs * doc(lxl): add 02_algo/offline_rl_zh draft * add offline rl doc * polish offline rl doc * polish offline rl * polish offline rl: reformat reference * polish offline rl: fix typo * doc(jrn): add 02_algo model_based_rl_zh (opendilab#128) * doc(jrn): add 02_algo mbrl * doc(jrn): add 02_algo mbrl * doc(jrn): modify 02_algo mbrl * doc(jrn): modify 02_algo mbrl * doc(jrn): polish 02_algo mbrl zh * modify(jrn): polish source/02_algo/model_based_rl_zh.rst * polish model_based_rl_zh.rst again * polish model_based_rl_zh.rst again * doc(jrn): add 02_algo mbrl * doc(jrn): add 02_algo mbrl * doc(jrn): modify 02_algo mbrl * doc(jrn): modify 02_algo mbrl * doc(jrn): polish 02_algo mbrl zh * modify(jrn): polish source/02_algo/model_based_rl_zh.rst * polish model_based_rl_zh.rst again * polish(zlx): 24-cooperation (opendilab#122) * polish(zlx): Init 24-cooperation(git + issue/pr) * polish(zlx): Add git_guide and issue_pr * fix(zlx): fix comments by xjx * feature(zlx): Add en version of 24-cooperation * polish(zlx): fix comments by xjx * polish model_based_rl_zh.rst again * polish(zjow): polish and revise quickstart and installation. (opendilab#121) * Polish Quickstart. * Minor change. * Minor change. * polish(nyz): add offline rl and gtrxl images * doc(pu): add exploration overview and footnote for exploration_rl_zh (opendilab#134) * feature(pu): add config_spec_zh * feature(pu): add config_spec_zh, basic_rl_zh, exploration_rl_zh, imitation_learning_zh * polish(pu):polish index * polish(pu): polish style * polish(pu): polish style * polish(pu): add exploration overview and footnote * fix(pu): fix wrongly changed file * polish(pu): add information-theory-based exloration part * trasnlate(zlx): Integrate past translation prs (opendilab#135) * translate(nyp): diayn zh * doc(py): ngu zh * translate(gh): smac zh * translate(gh): icm zh * translate(cy): cartpole & gym-hybrid en * translate(xzy): minigrd & pendulum en * translate(zyc&yf): bipedalwalker & lunarlander en * translate(hs): mujoco & procgen en * translate(hs): r2d3 en * polish(zlx): remove .. _ in 13_envs * polish(zlx): polish format * doc(wyh): algo02 MARL docs (opendilab#129) * doc(wyh):marl * doc(wyh):marl polish * translation(lxl): add offline_rl_en & polish offline_rl_zh (opendilab#136) * fix(nyz): fix offline rl author typo * polish(zlx): polish mujoco & r2d3 by hs, which are ignored before (opendilab#138) * doc(zms): add comments to "framework/middleware" (opendilab#137) * add the refs to comments of "framework/middleware" * change maxdepth of framework/index.rst from 4 to 2 * polish(lxl): polish offline_rl_zh, fix typo and gramma (opendilab#139) * add offline_rl_en * reorganize the description of Future & Outlooks * polish * doc(nyp): add best practice zh for our doc 1.0 (opendilab#132) * doc(wzl): add pettingzoo.zh doc (opendilab#124) * add pettingzoo_zh.rst * update pettingzoo_zh.rst * fix(hs):fix install atari_env error (opendilab#116) * fix install atari_env error * Update atari.rst * Update atari_zh.rst * add best practice zh for doc 1.0 * add rnn translation * finish rnn; fix some translations (wrappers) * modify regarding the comments * change wrt comment * modify unroll_len/sequence_len key * fix multi-discrete action space Co-authored-by: zerlinwang <[email protected]> Co-authored-by: norman <[email protected]> Co-authored-by: nieyunpeng <[email protected]> * Cleanup old resources * Space * Fix offline rl Co-authored-by: Hankson Bradley <[email protected]> Co-authored-by: 蒲源 <[email protected]> Co-authored-by: LuciusMos <[email protected]> Co-authored-by: zjowowen <[email protected]> Co-authored-by: zhaoliangxuan <[email protected]> Co-authored-by: Swain <[email protected]> Co-authored-by: Davide Liu <[email protected]> Co-authored-by: zms <[email protected]> Co-authored-by: lixl-st <[email protected]> Co-authored-by: Jia Ruonan <[email protected]> Co-authored-by: Weiyuhong-1998 <[email protected]> Co-authored-by: Will-Nie <[email protected]> Co-authored-by: zerlinwang <[email protected]> Co-authored-by: norman <[email protected]> Co-authored-by: nieyunpeng <[email protected]>
timothijoe · May 30, 2022 · df02585 · df02585
1 parent 6a7becd
commit df02585
Show file tree

Hide file tree

Showing 775 changed files with 9,970 additions and 13,178 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,10 +1,11 @@
-*.eps
-*.jpg
-*.svg
+*.puml.eps
+*.puml.jpg
+*.puml.svg
 .DS_Store
 build/
 source/_build
 _build/
 .vscode/
 venv/
-.idea/
+.idea/
+src/pytorch-sphinx-theme/
diff --git a/requirements.txt b/requirements.txt
@@ -1,8 +1,9 @@
 Pillow==8.2.0
 sphinx>=2.2.1,<=4.2
-sphinx_rtd_theme~=0.4.3
+sphinx_rtd_theme
 enum_tools
 sphinx-toolbox
 plantumlcli>=0.0.2
 sphinx-autobuild
 git+http://github.com/opendilab/DI-engine@main
+-e git+https://github.com/opendilab/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
diff --git a/source/00_intro/index.rst b/source/00_intro/index.rst
@@ -0,0 +1,37 @@
+Introduction
+===============================
+
+What is DI-engine?
+-------------------------------
+
+DI-engine is a decision intelligence platform built by a group of enthusiastic researchers and engineers, \
+that will provide you with the most professional and convenient assistance for your reinforcement learning algorithm research \
+and development work, mainly including:
+
+1. Comprehensive algorithm support, such as DQN, PPO, SAC, and many related algorithms for research subfields - \
+   QMIX for multi-intelligent reinforcement learning, GAIL for inverse reinforcement learning, RND for exploration problems, etc.
+
+2. User-friendly interface, we abstract most common objects in reinforcement learning tasks, such as environments, policies, \
+   and encapsulate complex reinforcement learning processes into middleware, allowing you to build your own learning process as you wish.
+
+3. Flexible scalability, using the integrated messaging components and event programming interfaces within the framework, \
+   you can flexibly scale your basic research work to industrial-grade large-scale training clusters, \
+   such as StarCraft Intelligence `DI-star <https://github.com/opendilab/DI-star>`_.
+
+.. image::
+   ../images/system_layer.png
+
+Key Concepts
+-------------------------------
+
+If you are not familiar with reinforcement learning, you can go to our `reinforcement learning tutorial <../10_concepts/index_zh.html>`_ \
+for a glimpse into the wonderful world of reinforcement learning.
+
+If you have already been exposed to reinforcement learning, you will already be familiar with the basic interaction objects of reinforcement learning: \
+**environments** and **agents (or the policies that make them up)**.
+
+Instead of creating more concepts, the DI-engine abstracts the complex interaction logic between the two into declarative middleware, \
+such as **collect**, **train**, **evaluate**, and **save_ckpt**. You can adapt each part of the process in the most natural way.
+
+Using the DI-engine will be very easy, in the `quickstart <... /01_quickstart/index_zh.html>`_, \
+we will show you how to quickly build a classic reinforcement learning process using DI-engine with a simple example.
diff --git a/source/00_intro/index_zh.rst b/source/00_intro/index_zh.rst
@@ -0,0 +1,28 @@
+DI-engine 简介
+===============================
+
+了解 DI-engine
+-------------------------------
+
+DI-engine 是由一群充满活力的研究员和工程师打造的决策智能平台，它将为您的强化学习算法研究和开发工作提供最专业最便捷的帮助，主要包括：
+
+1. 完整的算法支持，例如 DQN，PPO，SAC 以及许多研究子领域的相关算法——多智能体强化学习中的 QMIX，逆强化学习中的 GAIL，探索问题中的 RND 等等。
+
+2. 友好的用户接口，我们抽象了强化学习任务中的大部分常见对象，例如环境，策略，并将复杂的强化学习流程封装成丰富的中间件，让您随心所欲的构建自己的学习流程。
+
+3. 弹性的拓展能力，利用框架内集成的消息组件和事件编程接口，您可以灵活的将基础研究工作拓展到工业级大规模训练集群中，例如星际争霸智能体 `DI-star <https://github.com/opendilab/DI-star>`_。
+
+.. image::
+   ../images/system_layer.png
+
+核心概念
+-------------------------------
+
+假如您尚未了解强化学习，可以转至我们的 `强化学习教程 <../10_concepts/index_zh.html>`_ 一窥强化学习的奇妙世界。
+
+假如您已经接触过强化学习，想必已经非常了解强化学习的基本交互对象： **环境** 和 **智能体（或者构成智能体的策略）**。
+
+DI-engine 没有创造更多的概念，而是将这两者之间复杂的交互逻辑抽象成了声明式的中间件，例如 **采集数据（collect）**，**训练模型（train）**，**评估模型（evaluate）**，**保存模型（save_ckpt）**，
+您可以依据最自然的方式调整流程中的各个环节。
+
+使用 DI-engine 将会非常简单，在 `快速开始 <../01_quickstart/index_zh.html>`_ 部分，我们将通过一个简单的例子向您介绍，如何使用 DI-engine 快速搭建一个经典的强化学习流程。
diff --git a/source/01_quickstart/first_rl_program.rst b/source/01_quickstart/first_rl_program.rst
@@ -0,0 +1,109 @@
+First Reinforcement Learning Program
+======================================
+
+.. toctree::
+   :maxdepth: 2
+
+CartPole is the ideal learning environment for an introduction to reinforcement learning, \
+and using the DQN algorithm allows CartPole to converge (maintain equilibrium) in a very short time. \
+We will introduce the use of DI-engine based on CartPole + DQN.
+
+.. image::
+    images/cartpole_cmp.gif
+    :width: 1000
+    :align: center
+
+Using the Configuration File
+------------------------------
+
+The DI-engine uses a global configuration file to control all variables of the environment and strategy, \
+each of which has a corresponding default configuration that can be found in \
+`cartpole_dqn_config <https://github.com/opendilab/DI-engine/blob/main/dizoo/classic_control/cartpole/config/cartpole_dqn_config.py>`_, \
+in the tutorial we use the default configuration directly:
+
+.. code-block:: python
+
+    from dizoo.classic_control.cartpole.config.cartpole_dqn_config import main_config, create_config
+    from ding.config import compile_config
+
+    cfg = compile_config(main_config, create_cfg=create_config, auto=True)
+
+Initialize the Environments
+------------------------------
+
+In reinforcement learning, there may be a difference in the strategy for collecting environment data \
+between the training process and the evaluation process, for example, the training process tends to train \
+one epoch for n steps of collection, while the evaluation process requires completing the whole game to get a score. \
+We recommend that the collection and evaluation environments be initialized separately as follows.
+
+.. code-block:: python
+
+    from ding.envs import DingEnvWrapper, BaseEnvManagerV2
+
+    collector_env = BaseEnvManagerV2(
+        env_fn=[lambda: DingEnvWrapper(gym.make("CartPole-v0")) for _ in range(cfg.env.collector_env_num)],
+        cfg=cfg.env.manager
+    )
+    evaluator_env = BaseEnvManagerV2(
+        env_fn=[lambda: DingEnvWrapper(gym.make("CartPole-v0")) for _ in range(cfg.env.evaluator_env_num)],
+        cfg=cfg.env.manager
+    )
+
+.. note::
+
+    DingEnvWrapper is a unified wrapper of DI-engine for different environment libraries. \
+    BaseEnvManagerV2 is a unified external interface for managing multiple environments. \
+    so you can use BaseEnvManagerV2 to collect multiple environments in parallel.
+
+Select Policy
+--------------
+
+DI-engine covers most of the reinforcement learning policies, using them only requires selecting the right policy and model.
+Since DQN is off-policy, we also need to instantiate a buffer module.
+
+.. code-block:: python
+
+    from ding.model import DQN
+    from ding.policy import DQNPolicy
+    from ding.data import DequeBuffer
+
+    model = DQN(**cfg.policy.model)
+    buffer_ = DequeBuffer(size=cfg.policy.other.replay_buffer.replay_buffer_size)
+    policy = DQNPolicy(cfg.policy, model=model)
+
+Build the Pipeline
+---------------------
+
+With the various middleware provided by DI-engine, we can easily build the entire pipeline:
+
+.. code-block:: python
+
+    from ding.framework import task
+    from ding.framework.context import OnlineRLContext
+    from ding.framework.middleware import OffPolicyLearner, StepCollector, interaction_evaluator, data_pusher, eps_greedy_handler, CkptSaver
+
+    with task.start(async_mode=False, ctx=OnlineRLContext()):
+        # Evaluating, we place it on the first place to get the score of the random model as a benchmark value
+        task.use(interaction_evaluator(cfg, policy.eval_mode, evaluator_env))
+        task.use(eps_greedy_handler(cfg))  # Decay probability of explore-exploit
+        task.use(StepCollector(cfg, policy.collect_mode, collector_env))  # Collect environmental data
+        task.use(data_pusher(cfg, buffer_))  # Push data to buffer
+        task.use(OffPolicyLearner(cfg, policy.learn_mode, buffer_))  # Train the model
+        task.use(CkptSaver(cfg, policy, train_freq=100))  # Save the model
+        # In the evaluation process, if the model is found to have exceeded the convergence value, it will end early here
+        task.run()
+
+Run the Code
+--------------
+
+The full example can be found in `DQN example <https://github.com/opendilab/DI-engine/blob/main/ding/example/dqn.py>`_ and can be run via ``python dqn.py``.
+
+.. image::
+    images/train_dqn.gif
+    :width: 1000
+    :align: center
+
+Now you have completed your first reinforcement learning task with DI-engine, you can try out more algorithms \
+in the `Examples directory <https://github.com/opendilab/DI-engine/blob/main/ding/example>`_, or continue reading \
+the documentation to get a deeper understanding of DI-engine's `Algorithm <../02_algo/index.html>`_, `System Design <../03_system/index.html>`_ \
+and `Best Practices <../04_best_practice/index.html>`_.
diff --git a/source/01_quickstart/first_rl_program_zh.rst b/source/01_quickstart/first_rl_program_zh.rst
@@ -0,0 +1,100 @@
+第一个强化学习程序
+============================
+
+.. toctree::
+   :maxdepth: 2
+
+CartPole 是强化学习入门的理想学习环境，使用 DQN 算法可以在很短的时间内让 CartPole 收敛（保持平衡）。
+我们将基于 CartPole + DQN 介绍一下 DI-engine 的用法。
+
+.. image::
+    images/cartpole_cmp.gif
+    :width: 1000
+    :align: center
+
+使用配置文件
+--------------
+
+DI-engine 使用一个全局的配置文件来控制环境和策略的所有变量，每个环境和策略都有对应的默认配置，可以在
+`cartpole_dqn_config <https://github.com/opendilab/DI-engine/blob/main/dizoo/classic_control/cartpole/config/cartpole_dqn_config.py>`_
+看到完整的配置，在教程里我们直接使用默认配置：
+
+.. code-block:: python
+
+    from dizoo.classic_control.cartpole.config.cartpole_dqn_config import main_config, create_config
+    from ding.config import compile_config
+
+    cfg = compile_config(main_config, create_cfg=create_config, auto=True)
+
+初始化采集环境和评估环境
+------------------------
+
+在强化学习中，训练阶段和评估阶段采集环境数据的策略可能有区别，例如训练阶段往往是采集 n 个步骤就训练一次，
+而评估阶段则需要完成整局游戏才能得到评分。我们推荐将采集和评估环境分开初始化：
+
+.. code-block:: python
+
+    from ding.envs import DingEnvWrapper, BaseEnvManagerV2
+
+    collector_env = BaseEnvManagerV2(
+        env_fn=[lambda: DingEnvWrapper(gym.make("CartPole-v0")) for _ in range(cfg.env.collector_env_num)],
+        cfg=cfg.env.manager
+    )
+    evaluator_env = BaseEnvManagerV2(
+        env_fn=[lambda: DingEnvWrapper(gym.make("CartPole-v0")) for _ in range(cfg.env.evaluator_env_num)],
+        cfg=cfg.env.manager
+    )
+
+.. note::
+
+    DingEnvWrapper 是 DI-engine 对不同环境库的统一封装。BaseEnvManagerV2 管理多个环境的统一对外接口，
+    利用 BaseEnvManagerV2 可以同时对多个环境进行并行采集。
+
+选择策略
+--------------
+
+DI-engine 覆盖了大部分强化学习策略，使用它们只需要选择正确的策略和模型即可。
+由于 DQN 是一个 off-policy 策略，所以我们还需要实例化一个 buffer 模块。
+
+.. code-block:: python
+
+    from ding.model import DQN
+    from ding.policy import DQNPolicy
+    from ding.data import DequeBuffer
+
+    model = DQN(**cfg.policy.model)
+    buffer_ = DequeBuffer(size=cfg.policy.other.replay_buffer.replay_buffer_size)
+    policy = DQNPolicy(cfg.policy, model=model)
+
+构建训练管线
+--------------
+
+利用 DI-engine 提供的各类中间件，我们可以很容易的构建整个训练管线：
+
+.. code-block:: python
+
+    from ding.framework import task
+    from ding.framework.context import OnlineRLContext
+    from ding.framework.middleware import OffPolicyLearner, StepCollector, interaction_evaluator, data_pusher, eps_greedy_handler, CkptSaver
+
+    with task.start(async_mode=False, ctx=OnlineRLContext()):
+        task.use(interaction_evaluator(cfg, policy.eval_mode, evaluator_env))  # 评估流程，放在第一个是为了获得随机模型的评分作为基准值
+        task.use(eps_greedy_handler(cfg))  # 衰减探索-利用的概率
+        task.use(StepCollector(cfg, policy.collect_mode, collector_env))  # 采集环境数据
+        task.use(data_pusher(cfg, buffer_))  # 将数据保存到 buffer
+        task.use(OffPolicyLearner(cfg, policy.learn_mode, buffer_))  # 训练模型
+        task.use(CkptSaver(cfg, policy, train_freq=100))  # 保存模型
+        task.run()  # 在评估流程中，如果发现模型表现已经超过了收敛值，这里将提前结束
+
+运行代码
+--------------
+
+代码完整的示例代码可以在 `DQN example <https://github.com/opendilab/DI-engine/blob/main/ding/example/dqn.py>`_ 中找到，通过 ``python dqn.py`` 即可运行代码
+
+.. image::
+    images/train_dqn.gif
+    :width: 1000
+    :align: center
+
+至此您已经完成了 DI-engine 的第一个强化学习任务，您可以在 `示例目录 <https://github.com/opendilab/DI-engine/blob/main/ding/example>`_ 中尝试更多的算法，
+或继续阅读文档来深入了解 DI-engine 的 `算法 <../02_algo/index_zh.html>`_， `系统设计 <../03_system/index_zh.html>`_ 和 `最佳实践 <../04_best_practice/index_zh.html>`_。
diff --git a/source/quick_start/images/cartpole_cmp.gif → source/01_quickstart/images/cartpole_cmp.gif b/source/quick_start/images/cartpole_cmp.gif → source/01_quickstart/images/cartpole_cmp.gif
diff --git a/source/01_quickstart/images/train_dqn.gif b/source/01_quickstart/images/train_dqn.gif
diff --git a/source/01_quickstart/index.rst b/source/01_quickstart/index.rst
@@ -0,0 +1,8 @@
+Quickstart
+============================
+
+.. toctree::
+   :maxdepth: 2
+
+   installation
+   first_rl_program
diff --git a/source/01_quickstart/index_zh.rst b/source/01_quickstart/index_zh.rst
@@ -0,0 +1,8 @@
+快速开始
+============================
+
+.. toctree::
+   :maxdepth: 2
+
+   installation_zh
+   first_rl_program_zh