Skip to content

Commit

Permalink
Support live preview
Browse files Browse the repository at this point in the history
  • Loading branch information
sailxjx committed Dec 13, 2021
1 parent d30df3c commit e6b94de
Show file tree
Hide file tree
Showing 13 changed files with 67 additions and 63 deletions.
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ DIAGRAMS := $(MAKE) -f "${DIAGRAMS_MK}"
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SPHINXLIVE = sphinx-autobuild
SOURCEDIR = source
BUILDDIR = build

Expand All @@ -18,6 +19,8 @@ diagrams:
html: diagrams
@$(SPHINXBUILD) -M html "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
@$(SPHINXBUILD) -M html "$(SOURCEDIR)" "$(BUILDDIR)" ./source/**/*_zh.* $(SPHINXOPTS) $(O) -D master_doc=index_zh
live: diagrams
@$(SPHINXLIVE) "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
build: html
clean:
@$(DIAGRAMS) clean
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ sphinx_rtd_theme~=0.4.3
enum_tools
sphinx-toolbox
plantumlcli>=0.0.2
sphinx-autobuild
git+http://github.com/opendilab/DI-engine@main
2 changes: 1 addition & 1 deletion source/env_tutorial/slime_volleyball_zh.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ hub <https://hub.docker.com/repository/docker/opendilab/ding>`__\ 获取更多

变换前的空间(原始环境)
========================
注:这里以 ``SlimeVolley-v0`` 为例,因为对 ``self-play`` 系列算法做基准测试自然是简单优先。如要用到其他两个环境,可结合原仓库查看,并根据 `DI-engine的API <https://di-engine-docs.readthedocs.io/en/main-zh/feature/env_overview_en.html>`_ 进行相应适配。
注:这里以 ``SlimeVolley-v0`` 为例,因为对 ``self-play`` 系列算法做基准测试自然是简单优先。如要用到其他两个环境,可结合原仓库查看,并根据 `DI-engine的API <https://di-engine-docs.readthedocs.io/en/main-zh/feature/env_overview.html>`_ 进行相应适配。

.. _观察空间-1:

Expand Down
4 changes: 2 additions & 2 deletions source/hands_on/a2c.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ The default config is defined as follows:
.. autoclass:: ding.policy.a2c.A2CPolicy
:noindex:

The network interface A2C used is defined as follows:
The network interface A2C used is defined as follows:

.. autoclass:: ding.model.template.vac.VAC
:members: __init__, forward, compute_actor, compute_critic, compute_actor_critic
Expand All @@ -94,7 +94,7 @@ The policy gradient and value update of A2C is implemented as follows:
value_loss = (F.mse_loss(return_, value, reduction='none') * weight).mean()
return a2c_loss(policy_loss, value_loss, entropy_loss)
The Benchmark results of A2C implemented in DI-engine can be found in `Benchmark <../feature/algorithm_overview_en.html>`_.
The Benchmark results of A2C implemented in DI-engine can be found in `Benchmark <../feature/algorithm_overview.html>`_.

References
-----------
Expand Down
6 changes: 3 additions & 3 deletions source/hands_on/c51_qrdqn_iqn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ The network interface C51 used is defined as follows:

The bellman updates of C51 is implemented as:

The Benchmark result of C51 implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
The Benchmark result of C51 implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_

QRDQN
^^^^^^^
Expand Down Expand Up @@ -131,7 +131,7 @@ The network interface QRDQN used is defined as follows:

The bellman updates of QRDQN is implemented in the function ``qrdqn_nstep_td_error`` of ``ding/rl_utils/td.py``.

The Benchmark result of QRDQN implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
The Benchmark result of QRDQN implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_

IQN
^^^^^^^
Expand Down Expand Up @@ -200,7 +200,7 @@ The network interface IQN used is defined as follows:

The bellman updates of IQN used is defined in the function ``iqn_nstep_td_error`` of ``ding/rl_utils/td.py``.

The Benchmark result of IQN implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
The Benchmark result of IQN implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_


References
Expand Down
4 changes: 2 additions & 2 deletions source/hands_on/ddpg.rst
Original file line number Diff line number Diff line change
Expand Up @@ -242,7 +242,7 @@ We configure ``learn.target_theta`` to control the interpolation factor in avera
update_kwargs={'theta': self._cfg.learn.target_theta}
)
The Benchmark result of DDPG implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
The Benchmark result of DDPG implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_

Other Public Implementations
----------------------------
Expand All @@ -264,4 +264,4 @@ Other Public Implementations

References
-----------
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra: “Continuous control with deep reinforcement learning”, 2015; [http://arxiv.org/abs/1509.02971 arXiv:1509.02971].
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra: “Continuous control with deep reinforcement learning”, 2015; [http://arxiv.org/abs/1509.02971 arXiv:1509.02971].
4 changes: 2 additions & 2 deletions source/hands_on/dqn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ DQN can be combined with:

- Dueling head

In `Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>`_, dueling head architecture is utilized to implement the decomposition of state-value and advantage for taking each action, and use these two parts to construct the final q_value, which is better for evaluating the value of some states in which not all actions can be sampled
In `Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>`_, dueling head architecture is utilized to implement the decomposition of state-value and advantage for taking each action, and use these two parts to construct the final q_value, which is better for evaluating the value of some states in which not all actions can be sampled

The specific architecture is shown in the following graph:

Expand All @@ -89,7 +89,7 @@ The network interface DQN used is defined as follows:
:members: __init__, forward
:noindex:

The Benchmark result of DQN implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
The Benchmark result of DQN implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_


Reference
Expand Down
2 changes: 1 addition & 1 deletion source/hands_on/impala.rst
Original file line number Diff line number Diff line change
Expand Up @@ -322,7 +322,7 @@ The network interface IMPALA used is defined as follows:
:members: __init__, forward
:noindex:

The Benchmark result of IMPALA implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
The Benchmark result of IMPALA implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_


Reference
Expand Down
2 changes: 1 addition & 1 deletion source/hands_on/ppo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ The policy gradient and value update of PPO is implemented as follows:
return ppo_loss(policy_output.policy_loss, value_loss, policy_output.entropy_loss), policy_info
The Benchmark result of PPO implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_.
The Benchmark result of PPO implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_.


References
Expand Down
22 changes: 11 additions & 11 deletions source/hands_on/rainbow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@ Prioritized Experience Replay(PER)
DQN samples uniformly from the replay buffer. Ideally, we want to sample more frequently those transitions from which there is much to learn. As a proxy for learning potential, prioritized experience replay samples transitions with probabilities relative to the last encountered absolute TD error, formally:

.. math::
p_{t} \propto\left|R_{t+1}+\gamma_{t+1} \max _{a^{\prime}} q_{\bar{\theta}}\left(S_{t+1}, a^{\prime}\right)-q_{\theta}\left(S_{t}, A_{t}\right)\right|^{\omega}
In the original paper of PER, the authors show that PER achieve improvements on most of the 57 Atari games, especially on Gopher, Atlantis, James Bond 007, Space Invaders, etc.

Expand All @@ -46,7 +46,7 @@ streams, sharing a convolutional encoder, and merged by a special aggregator. Th
.. math::
q_{\theta}(s, a)=v_{\eta}\left(f_{\xi}(s)\right)+a_{\psi}\left(f_{\xi}(s), a\right)-\frac{\sum_{a^{\prime}} a_{\psi}\left(f_{\xi}(s), a^{\prime}\right)}{N_{\text {actions }}}
The network architecture of Rainbow is a dueling network architecture adapted for use with return distributions. The network has a shared representation, which is then fed into a value stream :math:`v_\eta` with :math:`N_{atoms}` outputs, and into an advantage stream :math:`a_{\psi}` with :math:`N_{atoms} \times N_{actions}` outputs, where :math:`a_{\psi}^i(a)` will denote the output corresponding to atom i and action a. For each atom :math:`z_i`, the value and advantage streams are aggregated, as in dueling DQN, and then passed through a softmax layer to obtain the normalized parametric distributions used to estimate the returns’ distributions:

.. math::
Expand All @@ -60,7 +60,7 @@ Multi-step Learning
-------------------
A multi-step variant of DQN is then defined by minimizing the alternative loss:


.. math::
\left(R_{t}^{(n)}+\gamma_{t}^{(n)} \max _{a^{\prime}} q_{\bar{\theta}}\left(S_{t+n}, a^{\prime}\right)-q_{\theta}\left(S_{t}, A_{t}\right)\right)^{2}
Expand All @@ -79,7 +79,7 @@ Noisy Net
Noisy Nets use a noisy linear layer that combines a deterministic and noisy stream:

.. math::
\boldsymbol{y}=(\boldsymbol{b}+\mathbf{W} \boldsymbol{x})+\left(\boldsymbol{b}_{\text {noisy }} \odot \epsilon^{b}+\left(\mathbf{W}_{\text {noisy }} \odot \epsilon^{w}\right) \boldsymbol{x}\right)
Over time, the network can learn to ignore the noisy stream, but at different rates in different parts of the state space, allowing state-conditional exploration with a form of self-annealing. It usually achieves improvements against epsilon-greedy when the action space is large, e.g. Montezuma's Revenge, because epsilon-greedy tends to quickly converge to a one-hot distribution before the rewards of the large numbers of actions are collected enough.
Expand All @@ -103,14 +103,14 @@ The network interface Rainbow used is defined as follows:
:members: __init__, forward
:noindex:

The Benchmark result of Rainbow implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview_en.html>`_
The Benchmark result of Rainbow implemented in DI-engine is shown in `Benchmark <../feature/algorithm_overview.html>`_



Experiments on Rainbow Tricks
-----------------------------
We conduct experiments on the lunarlander environment using rainbow (dqn) policy to compare the performance of n-step, dueling, priority, and priority_IS tricks with baseline. The code link for the experiments is `here <https://github.com/opendilab/DI-engine/blob/main/dizoo/box2d/lunarlander/config/lunarlander_dqn_config.py>`_.
Note that the config file is set for ``dqn`` by default. If we want to adopt ``rainbow`` policy, we need to change the
Note that the config file is set for ``dqn`` by default. If we want to adopt ``rainbow`` policy, we need to change the
type of policy as below.

.. code-block:: python
Expand All @@ -125,7 +125,7 @@ type of policy as below.
policy=dict(type='rainbow'),
)
The detailed experiments setting is stated below.

+---------------------+---------------------------------------------------------------------------------------------------+
Expand All @@ -146,7 +146,7 @@ The detailed experiments setting is stated below.


1. ``reward_mean`` over ``training iteration`` is used as an evaluation metric.

2. Each experiment setting is done for three times with random seed 0, 1, 2 and average the results to ensure stochasticity.

.. code-block:: python
Expand All @@ -170,9 +170,9 @@ The detailed experiments setting is stated below.
The result is shown in the figure below. As we can see, with tricks on, the speed of convergence is increased by a large amount. In this experiment setting, dueling trick contributes most to the performance.
The result is shown in the figure below. As we can see, with tricks on, the speed of convergence is increased by a large amount. In this experiment setting, dueling trick contributes most to the performance.
.. image::
.. image::
images/rainbow_exp.png
:align: center
Expand Down
8 changes: 4 additions & 4 deletions source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Overview
------------
DI-engine is a generalized Decision Intelligence engine. It supports most basic deep reinforcement learning (DRL) algorithms,
such as DQN, PPO, SAC, and domain-specific algorithms like QMIX in multi-agent RL, GAIL in inverse RL, and RND in exploration problems.
The whole supported algorithms introduction can be found in `Algorithm <./feature/algorithm_overview_en.html>`_.
The whole supported algorithms introduction can be found in `Algorithm <./feature/algorithm_overview.html>`_.

For scalability, DI-engine supports three different training pipeline:

Expand All @@ -36,17 +36,17 @@ For scalability, DI-engine supports three different training pipeline:
Main Features
--------------

* DI-zoo: High performance DRL algorithm zoo, algorithm support list. `Link <feature/algorithm_overview_en.html>`_
* DI-zoo: High performance DRL algorithm zoo, algorithm support list. `Link <feature/algorithm_overview.html>`_
* Generalized decision intelligence algorithms: DRL family, IRL family, MARL family, searching family(MCTS) and etc.
* Customized DRL demand implementation, such as Inverse RL/RL hybrid training; Multi-buffer training; League self-play training
* Large scale DRL training demonstration and application
* Various efficiency optimization module: DI-hpc, DI-store, EnvManager, DataLoader
* k8s support, DI-orchestrator k8s cluster scheduler for dynamic collectors and other services


To get started, take a look over the `quick start <./quick_start/index.html>`_ and `API documentation <./api_doc/index.html>`_.
For RL beginners, DI-engine advises you to refer to `hands-on RL <hands_on/index.html>`_ for more discussion.
If you want to deeply customize your algorithm and application with DI-engine, also checkout `key concept <./key_concept/index.html>`_ and `Feature <./feature/index.html>`_.
If you want to deeply customize your algorithm and application with DI-engine, also checkout `key concept <./key_concept/index.html>`_ and `Feature <./feature/index.html>`_.

.. toctree::
:maxdepth: 2
Expand Down
4 changes: 2 additions & 2 deletions source/index_zh.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
概述
------------
DI-engine是一个通用决策智能平台。它支持大多数常用的深度强化学习算法,例如DQN,PPO,SAC以及许多研究子领域的相关算法——多智能体强化学习
中的QMIX,逆强化学习中的GAIL,探索问题中的RND。所有现已支持的算法和相关算法性能介绍可以查看 `算法概述 <./feature/algorithm_overview_en.html>`_
中的QMIX,逆强化学习中的GAIL,探索问题中的RND。所有现已支持的算法和相关算法性能介绍可以查看 `算法概述 <./feature/algorithm_overview.html>`_

为了在各种计算尺度上的通用性和扩展性,DI-engine支持3种不同的训练模式:

Expand All @@ -35,7 +35,7 @@ DI-engine是一个通用决策智能平台。它支持大多数常用的深度
核心特点
--------------

* DI-zoo:高性能深度强化学习算法库,具体信息可以参考 `传送门 <feature/algorithm_overview_en.html>`_
* DI-zoo:高性能深度强化学习算法库,具体信息可以参考 `传送门 <feature/algorithm_overview.html>`_
* 最全最广的决策AI算法实现:深度强化学习算法族,逆强化学习算法族,多智能体强化学习算法族,基于搜索的算法(例如蒙特卡洛树搜索)等等
* 支持各种定制化算法实现,例如强化学习/逆强化学习混合训练;多数据队列训练;联盟自对战博弈训练
* 支持大规模深度强化学习训练和评测
Expand Down
Loading

0 comments on commit e6b94de

Please sign in to comment.