Support live preview

timothijoe · Dec 13, 2021 · e6b94de · e6b94de
1 parent d30df3c
commit e6b94de
Show file tree

Hide file tree

Showing 13 changed files with 67 additions and 63 deletions.
diff --git a/Makefile b/Makefile
@@ -5,6 +5,7 @@ DIAGRAMS    := $(MAKE) -f "${DIAGRAMS_MK}"
 # You can set these variables from the command line.
 SPHINXOPTS    =
 SPHINXBUILD   = sphinx-build
+SPHINXLIVE    = sphinx-autobuild
 SOURCEDIR     = source
 BUILDDIR      = build
 
@@ -18,6 +19,8 @@ diagrams:
 html: diagrams
 	@$(SPHINXBUILD) -M html "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
 	@$(SPHINXBUILD) -M html "$(SOURCEDIR)" "$(BUILDDIR)" ./source/**/*_zh.* $(SPHINXOPTS) $(O) -D master_doc=index_zh
+live: diagrams
+	@$(SPHINXLIVE) "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
 build: html
 clean:
 	@$(DIAGRAMS) clean

diff --git a/requirements.txt b/requirements.txt
@@ -4,4 +4,5 @@ sphinx_rtd_theme~=0.4.3
 enum_tools
 sphinx-toolbox
 plantumlcli>=0.0.2
+sphinx-autobuild
 git+http://github.com/opendilab/DI-engine@main
diff --git a/source/env_tutorial/slime_volleyball_zh.rst b/source/env_tutorial/slime_volleyball_zh.rst
@@ -47,7 +47,7 @@ hub <https://hub.docker.com/repository/docker/opendilab/ding>`__\ 获取更多
 
 变换前的空间（原始环境）
 ========================
-注：这里以 ``SlimeVolley-v0`` 为例，因为对 ``self-play`` 系列算法做基准测试自然是简单优先。如要用到其他两个环境，可结合原仓库查看，并根据 `DI-engine的API <https://di-engine-docs.readthedocs.io/en/main-zh/feature/env_overview_en.html>`_ 进行相应适配。
+注：这里以 ``SlimeVolley-v0`` 为例，因为对 ``self-play`` 系列算法做基准测试自然是简单优先。如要用到其他两个环境，可结合原仓库查看，并根据 `DI-engine的API <https://di-engine-docs.readthedocs.io/en/main-zh/feature/env_overview.html>`_ 进行相应适配。
 
 .. _观察空间-1:
 

diff --git a/source/hands_on/a2c.rst b/source/hands_on/a2c.rst
@@ -73,7 +73,7 @@ The default config is defined as follows:
     .. autoclass:: ding.policy.a2c.A2CPolicy
         :noindex:
 
-The network interface A2C used is defined as follows: 
+The network interface A2C used is defined as follows:
 
     .. autoclass:: ding.model.template.vac.VAC
         :members: __init__, forward, compute_actor, compute_critic, compute_actor_critic
@@ -94,7 +94,7 @@ The policy gradient and value update of A2C is implemented as follows:
         value_loss = (F.mse_loss(return_, value, reduction='none') * weight).mean()
         return a2c_loss(policy_loss, value_loss, entropy_loss)
 
-The Benchmark results of A2C implemented in DI-engine can be found in `Benchmark <../feature/algorithm_overview_en.html>`_.
+The Benchmark results of A2C implemented in DI-engine can be found in `Benchmark <../feature/algorithm_overview.html>`_.
 
 References
 -----------

diff --git a/source/hands_on/c51_qrdqn_iqn.rst b/source/hands_on/c51_qrdqn_iqn.rst
@@ -63,7 +63,7 @@ The network interface C51 used is defined as follows:
 
 The bellman updates of C51 is implemented as:
 
-The Benchmark result of C51 implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
+The Benchmark result of C51 implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_
 
 QRDQN
 ^^^^^^^
@@ -131,7 +131,7 @@ The network interface QRDQN used is defined as follows:
 
 The bellman updates of QRDQN is implemented in the function ``qrdqn_nstep_td_error`` of ``ding/rl_utils/td.py``.
 
-The Benchmark result of QRDQN implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
+The Benchmark result of QRDQN implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_
 
 IQN
 ^^^^^^^
@@ -200,7 +200,7 @@ The network interface IQN used is defined as follows:
 
 The bellman updates of IQN used is defined in the function ``iqn_nstep_td_error`` of ``ding/rl_utils/td.py``.
 
-The Benchmark result of IQN implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
+The Benchmark result of IQN implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_
 
 
 References

diff --git a/source/hands_on/ddpg.rst b/source/hands_on/ddpg.rst
@@ -242,7 +242,7 @@ We configure ``learn.target_theta`` to control the interpolation factor in avera
         update_kwargs={'theta': self._cfg.learn.target_theta}
     )
 
-The Benchmark result of DDPG implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
+The Benchmark result of DDPG implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_
 
 Other Public Implementations
 ----------------------------
@@ -264,4 +264,4 @@ Other Public Implementations
 
 References
 -----------
-Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra: “Continuous control with deep reinforcement learning”, 2015; [http://arxiv.org/abs/1509.02971 arXiv:1509.02971].
+Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra: “Continuous control with deep reinforcement learning”, 2015; [http://arxiv.org/abs/1509.02971 arXiv:1509.02971].
diff --git a/source/hands_on/dqn.rst b/source/hands_on/dqn.rst
@@ -66,7 +66,7 @@ DQN can be combined with:
 
     - Dueling head
 
-      In `Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>`_, dueling head architecture is utilized to implement the decomposition of state-value and advantage for taking each action, and use these two parts to construct the final q_value, which is better for evaluating the value of some states in which not all actions can be sampled 
+      In `Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>`_, dueling head architecture is utilized to implement the decomposition of state-value and advantage for taking each action, and use these two parts to construct the final q_value, which is better for evaluating the value of some states in which not all actions can be sampled
 
         The specific architecture is shown in the following graph:
 
@@ -89,7 +89,7 @@ The network interface DQN used is defined as follows:
    :members: __init__, forward
    :noindex:
 
-The Benchmark result of DQN implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
+The Benchmark result of DQN implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_
 
 
 Reference

diff --git a/source/hands_on/impala.rst b/source/hands_on/impala.rst
@@ -322,7 +322,7 @@ The network interface IMPALA used is defined as follows:
         :members: __init__, forward
         :noindex:
 
-The Benchmark result of IMPALA implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
+The Benchmark result of IMPALA implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_
 
 
 Reference

diff --git a/source/hands_on/ppo.rst b/source/hands_on/ppo.rst
@@ -91,7 +91,7 @@ The policy gradient and value update of PPO is implemented as follows:
 
         return ppo_loss(policy_output.policy_loss, value_loss, policy_output.entropy_loss), policy_info
 
-The Benchmark result of PPO implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_.
+The Benchmark result of PPO implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_.
 
 
 References

diff --git a/source/hands_on/rainbow.rst b/source/hands_on/rainbow.rst
@@ -32,9 +32,9 @@ Prioritized Experience Replay(PER)
 DQN samples uniformly from the replay buffer. Ideally, we want to sample more frequently those transitions from which there is much to learn. As a proxy for learning potential, prioritized experience replay samples transitions with probabilities relative to the last encountered absolute TD error, formally:
 
 .. math::
-   
+
    p_{t} \propto\left|R_{t+1}+\gamma_{t+1} \max _{a^{\prime}} q_{\bar{\theta}}\left(S_{t+1}, a^{\prime}\right)-q_{\theta}\left(S_{t}, A_{t}\right)\right|^{\omega}
-   
+
 
 In the original paper of PER, the authors show that PER achieve improvements on most of the 57 Atari games, especially on Gopher, Atlantis, James Bond 007, Space Invaders, etc.
 
@@ -46,7 +46,7 @@ streams, sharing a convolutional encoder, and merged by a special aggregator. Th
 .. math::
 
    q_{\theta}(s, a)=v_{\eta}\left(f_{\xi}(s)\right)+a_{\psi}\left(f_{\xi}(s), a\right)-\frac{\sum_{a^{\prime}} a_{\psi}\left(f_{\xi}(s), a^{\prime}\right)}{N_{\text {actions }}}
-   
+
 The network architecture of Rainbow is a dueling network architecture adapted for use with return distributions. The network has a shared representation, which is then fed into a value stream :math:`v_\eta` with :math:`N_{atoms}` outputs, and into an advantage stream :math:`a_{\psi}` with :math:`N_{atoms} \times N_{actions}` outputs, where :math:`a_{\psi}^i(a)` will denote the output corresponding to atom i and action a. For each atom :math:`z_i`, the value and advantage streams are aggregated, as in dueling DQN, and then passed through a softmax layer to obtain the normalized parametric distributions used to estimate the returns’ distributions:
 
 .. math::
@@ -60,7 +60,7 @@ Multi-step Learning
 -------------------
 A multi-step variant of DQN is then defined by minimizing the alternative loss:
 
-   
+
 .. math::
 
    \left(R_{t}^{(n)}+\gamma_{t}^{(n)} \max _{a^{\prime}} q_{\bar{\theta}}\left(S_{t+n}, a^{\prime}\right)-q_{\theta}\left(S_{t}, A_{t}\right)\right)^{2}
@@ -79,7 +79,7 @@ Noisy Net
 Noisy Nets use a noisy linear layer that combines a deterministic and noisy stream:
 
 .. math::
-   
+
    \boldsymbol{y}=(\boldsymbol{b}+\mathbf{W} \boldsymbol{x})+\left(\boldsymbol{b}_{\text {noisy }} \odot \epsilon^{b}+\left(\mathbf{W}_{\text {noisy }} \odot \epsilon^{w}\right) \boldsymbol{x}\right)
 
 Over time, the network can learn to ignore the noisy stream, but at different rates in different parts of the state space, allowing state-conditional exploration with a form of self-annealing. It usually achieves improvements against epsilon-greedy when the action space is large, e.g. Montezuma's Revenge, because epsilon-greedy tends to quickly converge to a one-hot distribution before the rewards of the large numbers of actions are collected enough.
@@ -103,14 +103,14 @@ The network interface Rainbow used is defined as follows:
    :members: __init__, forward
    :noindex:
 
-The Benchmark result of Rainbow implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
+The Benchmark result of Rainbow implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_
 
 
 
 Experiments on Rainbow Tricks
 -----------------------------
 We conduct experiments on the lunarlander environment using rainbow (dqn) policy to compare the performance of n-step, dueling, priority, and priority_IS tricks with baseline. The code link for the experiments is `here <https://github.com/opendilab/DI-engine/blob/main/dizoo/box2d/lunarlander/config/lunarlander_dqn_config.py>`_.
-Note that the config file is set for ``dqn`` by default. If we want to adopt ``rainbow`` policy, we need to change the 
+Note that the config file is set for ``dqn`` by default. If we want to adopt ``rainbow`` policy, we need to change the
 type of policy as below.
 
 .. code-block:: python
@@ -125,7 +125,7 @@ type of policy as below.
     policy=dict(type='rainbow'),
    )
 
-   
+
 The detailed experiments setting is stated below.
 
 +---------------------+---------------------------------------------------------------------------------------------------+
@@ -146,7 +146,7 @@ The detailed experiments setting is stated below.
 
 
 1. ``reward_mean`` over ``training iteration`` is used as an evaluation metric.
-   
+
 2. Each experiment setting is done for three times with random seed 0, 1, 2 and average the results to ensure stochasticity.
 
 .. code-block:: python
@@ -170,9 +170,9 @@ The detailed experiments setting is stated below.
 
 
 
-The result is shown in the figure below. As we can see, with tricks on, the speed of convergence is increased by a large amount. In this experiment setting, dueling trick contributes most to the performance. 
+The result is shown in the figure below. As we can see, with tricks on, the speed of convergence is increased by a large amount. In this experiment setting, dueling trick contributes most to the performance.
 
-.. image:: 
+.. image::
    images/rainbow_exp.png
    :align: center
 

diff --git a/source/index.rst b/source/index.rst
@@ -11,7 +11,7 @@ Overview
 ------------
 DI-engine is a generalized Decision Intelligence engine. It supports most basic deep reinforcement learning (DRL) algorithms,
 such as DQN, PPO, SAC, and domain-specific algorithms like QMIX in multi-agent RL, GAIL in inverse RL, and RND in exploration problems.
-The whole supported algorithms introduction can be found in `Algorithm <./feature/algorithm_overview_en.html>`_.
+The whole supported algorithms introduction can be found in `Algorithm <./feature/algorithm_overview.html>`_.
 
 For scalability, DI-engine supports three different training pipeline:
 
@@ -36,17 +36,17 @@ For scalability, DI-engine supports three different training pipeline:
 Main Features
 --------------
 
-  * DI-zoo: High performance DRL algorithm zoo, algorithm support list. `Link <feature/algorithm_overview_en.html>`_
+  * DI-zoo: High performance DRL algorithm zoo, algorithm support list. `Link <feature/algorithm_overview.html>`_
   * Generalized decision intelligence algorithms: DRL family, IRL family, MARL family, searching family(MCTS) and etc.
   * Customized DRL demand implementation, such as Inverse RL/RL hybrid training; Multi-buffer training; League self-play training
   * Large scale DRL training demonstration and application
   * Various efficiency optimization module: DI-hpc, DI-store, EnvManager, DataLoader
   * k8s support, DI-orchestrator k8s cluster scheduler for dynamic collectors and other services
-   
+
 
 To get started, take a look over the `quick start <./quick_start/index.html>`_ and `API documentation <./api_doc/index.html>`_.
 For RL beginners, DI-engine advises you to refer to `hands-on RL <hands_on/index.html>`_ for more discussion.
-If you want to deeply customize your algorithm and application with DI-engine, also checkout `key concept <./key_concept/index.html>`_ and `Feature <./feature/index.html>`_. 
+If you want to deeply customize your algorithm and application with DI-engine, also checkout `key concept <./key_concept/index.html>`_ and `Feature <./feature/index.html>`_.
 
 .. toctree::
    :maxdepth: 2

diff --git a/source/index_zh.rst b/source/index_zh.rst
@@ -10,7 +10,7 @@
 概述
 ------------
 DI-engine是一个通用决策智能平台。它支持大多数常用的深度强化学习算法，例如DQN，PPO，SAC以及许多研究子领域的相关算法——多智能体强化学习
-中的QMIX，逆强化学习中的GAIL，探索问题中的RND。所有现已支持的算法和相关算法性能介绍可以查看 `算法概述 <./feature/algorithm_overview_en.html>`_
+中的QMIX，逆强化学习中的GAIL，探索问题中的RND。所有现已支持的算法和相关算法性能介绍可以查看 `算法概述 <./feature/algorithm_overview.html>`_
 
 为了在各种计算尺度上的通用性和扩展性，DI-engine支持3种不同的训练模式：
 
@@ -35,7 +35,7 @@ DI-engine是一个通用决策智能平台。它支持大多数常用的深度
 核心特点
 --------------
 
-  * DI-zoo：高性能深度强化学习算法库，具体信息可以参考 `传送门 <feature/algorithm_overview_en.html>`_
+  * DI-zoo：高性能深度强化学习算法库，具体信息可以参考 `传送门 <feature/algorithm_overview.html>`_
   * 最全最广的决策AI算法实现：深度强化学习算法族，逆强化学习算法族，多智能体强化学习算法族，基于搜索的算法（例如蒙特卡洛树搜索）等等
   * 支持各种定制化算法实现，例如强化学习/逆强化学习混合训练；多数据队列训练；联盟自对战博弈训练
   * 支持大规模深度强化学习训练和评测