microsoft · you-n-g · Nov 10, 2022 · Oct 19, 2022 · Oct 19, 2022 · Oct 20, 2022
diff --git a/README.md b/README.md
@@ -11,6 +11,7 @@
 Recent released features
 | Feature | Status |
 | --                      | ------    |
+| RL Learning Framework | :hammer: :chart_with_upwards_trend: Released on Oct 20, 2022. [#1322](https://github.com/microsoft/qlib/pull/1322), [#1316](https://github.com/microsoft/qlib/pull/1316),[#1299](https://github.com/microsoft/qlib/pull/1299),[#1263](https://github.com/microsoft/qlib/pull/1263), [#1244](https://github.com/microsoft/qlib/pull/1244), [#1169](https://github.com/microsoft/qlib/pull/1169), [#1125](https://github.com/microsoft/qlib/pull/1125), [#1076](https://github.com/microsoft/qlib/pull/1076)|
 | HIST and IGMTF models | :chart_with_upwards_trend: [Released](https://github.com/microsoft/qlib/pull/1040) on Apr 10, 2022 |
 | Qlib [notebook tutorial](https://github.com/microsoft/qlib/tree/main/examples/tutorial) | 📖 [Released](https://github.com/microsoft/qlib/pull/1037) on Apr 7, 2022 | 
 | Ibovespa index data | :rice: [Released](https://github.com/microsoft/qlib/pull/990) on Apr 6, 2022 |
@@ -67,6 +68,7 @@ For more details, please refer to our paper ["Qlib: An AI-oriented Quantitative
             <li type="circle"><a href="#auto-quant-research-workflow">Auto Quant Research Workflow</a></li>
             <li type="circle"><a href="#building-customized-quant-research-workflow-by-code">Building Customized Quant Research Workflow by Code</a></li></ul>
         <li><a href="#quant-dataset-zoo"><strong>Quant Dataset Zoo</strong></a></li>
+        <li><a href="#learning-framework">Learning Framework</a></li>
         <li><a href="#more-about-qlib">More About Qlib</a></li>
         <li><a href="#offline-mode-and-online-mode">Offline Mode and Online Mode</a>
         <ul>
@@ -114,7 +116,8 @@ At the module level, Qlib is a platform that consists of the above components. T
 | ------                 | -----                                                                                                                                                                                                                                                                                                                                                                   |
 | `Infrastructure` layer | `Infrastructure` layer provides underlying support for Quant research. `DataServer` provides a high-performance infrastructure for users to manage and retrieve raw data. `Trainer` provides a flexible interface to control the training process of models, which enable algorithms to control the training process.                                                       |
 | `Workflow` layer       | `Workflow` layer covers the whole workflow of quantitative investment. `Information Extractor` extracts data for models. `Forecast Model` focuses on producing all kinds of forecast signals (e.g. _alpha_, risk) for other modules. With these signals `Decision Generator` will generate the target trading decisions(i.e. portfolio, orders)  to be executed by `Execution Env` (i.e. the trading market).  There may be multiple levels of `Trading Agent` and `Execution Env` (e.g. an _order executor trading agent and intraday order execution environment_ could behave like an interday trading environment and nested in  _daily portfolio management trading agent and interday trading environment_  ) |
-| `Interface` layer      | `Interface` layer tries to present a user-friendly interface for the underlying system. `Analyser` module will provide users detailed analysis reports of forecasting signals, portfolios and execution results                                                                                                                                                                 |
+| `Learning Framework` layer| The `Forecast Model` and `Trading Agent` are learnable. They are learned based on the `Learning Framework` layer and then applied to multiple scenarios in `Workflow` layer. The supported learning paradigms can be categorized into reinforcement learning and supervised learning.  The learning framework leverages the `Workflow` layer as well(e.g. sharing `Information Extractor`, creating environments based on `Execution Env`).   |
+| `Interface` layer | `Interface` layer tries to present a user-friendly interface for the underlying system. `Analyser` module will provide users detailed analysis reports of forecasting signals, portfolios and execution results |
 
 * The modules with hand-drawn style are under development and will be released in the future.
 * The modules with dashed borders are highly user-customizable and extendible.
@@ -404,6 +407,17 @@ Dataset plays a very important role in Quant. Here is a list of the datasets bui
 [Here](https://qlib.readthedocs.io/en/latest/advanced/alpha.html) is a tutorial to build dataset with `Qlib`.
 Your PR to build new Quant dataset is highly welcomed.
 
+
+# Learning Framework
+Qlib is high customizable and a lot of its components are learnable.
+The learnable components are instances of `Forecast Model` and `Trading Agent`. They are learned based on the `Learning Framework` layer and then applied to multiple scenarios in `Workflow` layer.
+The learning framework leverages the `Workflow` layer as well(e.g. sharing `Information Extractor`, creating environments based on `Execution Env`).
+
+Based on learning paradigms, they can be categorized into reinforcement learning and supervised learning.
+- For supervised learning, the detailed docs can be found [here](https://qlib.readthedocs.io/en/latest/component/model.html).
+- For reinforcement learning, the detailed docs can be found [here](https://qlib.readthedocs.io/en/latest/component/rl.html). Qlib's RL learning framework leverages `Execution Env` in `Workflow` layer to create environments.  It's worth noting that `NestedExecutor` is supported as well. This empowers users to optimize different level of strategies/models/agents together (e.g. optimizing an order execution strategy for a specific portfolio management strategy).
+
+
 # More About Qlib
 If you want to have a quick glance at the most frequently used components of qlib, you can try notebooks [here](examples/tutorial/).
 

diff --git a/docs/_static/img/framework.svg b/docs/_static/img/framework.svg
diff --git a/docs/_static/img/qlib_rl_highlevel.png b/docs/_static/img/qlib_rl_highlevel.png
diff --git a/docs/component/rl.rst b/docs/component/rl.rst
@@ -0,0 +1,94 @@
+.. _rl:
+========================================================================
+Reinforcement Learning in Quantitative Trading
+========================================================================
+.. currentmodule:: qlib
+
+Introduction
+============
+The Qlib Reinforcement Learning toolkit (QlibRL) is the RL platform for quantitative investment. It contains a full set of components that cover the entire lifecycle of an RL pipeline, including building the simulator of the market, shaping states & actions, training policies (strategies), and backtesting strategies in the simulated environment.
+
+QlibRL is basically implemented within the frameworks of Tianshou and Gym. The high-level structure of QlibRL is demonstrated below:
+
+.. image:: ../_static/img/qlib_rl_highlevel.png
+
+Here, we briefly introduce each of the components in the figure.
+
+Base Modules
+============
+
+EnvWrapper
+------------
+EnvWrapper is the complete capsulation of the simulated environment. It receives actions from outside (policy / strategy / agent), simulates the changes of the market, and then replies rewards and updated states, thus forming an interaction loop.
+
+In QlibRL, EnvWrapper is a subclass of gym.Env, so it implements all necessary interfaces of gym.Env. Any classes or pipelines that accept gym.Env should also accept EnvWrapper. Developers do not need to implement their own EnvWrapper to build their own environment. Instead, they only need to implement 4 components of the EnvWrapper:
+
+- `Simulator`
+    The simulator is the core component responsible for the environment simulation. Developers could implement all the logic that is directly related to the environment simulation in the Simulator in any way they like. In QlibRL, there are already two implementations of Simulator: 1) ``SingleAssetOrderExecution``, which is built based on Qlib's backtest toolkits. 2) ``SimpleSingleAssetOrderExecution``, which is built based on naive simulation logic.
+- `State interpreter` 
+    The state interpreter is responsible for "interpret" states in the original format (format provided by the simulator) into states in a format that the policy could understand. For example, transform unstructured raw features into numerical tensors.
+- `Action interpreter` 
+    The action interpreter is similar to the state interpreter. But instead of states, it interprets actions generated by the policy, from the format provided by the policy to the format that is acceptable to the simulator.
+- `Reward function` 
+    The reward function returns a numerical reward to the policy after each time the policy takes an action. 
+
+EnvWrapper will organically organize these components. Such decomposition allows for better flexibility in development. For example, if the developers want to train multiple types of policies in one same environment, they only need to design one simulator, and design different state interpreters / action interpreters / reward functions for a different types of policies.
+
+QlibRL has well-defined base classes for all these 4 components. All the developers need to do is define their own components by inheriting the base classes and then implementing all interfaces required by the base classes.
+
+Policy
+------------
+QlibRL directly uses Tianshou's policy. Developers could use policies provided by Tianshou off the shelf, or implement their own policies by inheriting Tianshou's policies.
+
+Training Vessel & Trainer
+------------
+As stated by their names, training vessels and trainers are helper classes used in training. A training vessel is a ship that contains a simulator / interpreters / reward function / policy, and it controls algorithm-related parts of training. Correspondingly, the trainer is responsible for controlling the runtime parts of training.
+
+As you may have noticed, a training vessel itself holds all the required components to build an EnvWrapper rather than holding an instance of EnvWrapper directly. This allows the training vessel to create duplicates of EnvWrapper dynamically when necessary (for example, under parallel training).
+
+With a training vessel, the trainer could finally launch the training pipeline by simple, Scikit-learn-like interfaces (i.e., `trainer.fit()`).
+
+
+Potential Application Scenarios
+============
+
+Portfolio Construction
+------------
+Portfolio construction is a process of selecting securities optimally by taking a minimum risk to achieve maximum returns. With an RL-based solution, an agent allocates stocks at every time step by obtaining information for each stock and the market. The key is to develop of policy for building a portfolio and make the policy able to pick the optimal portfolio. 
+
+Order Execution
+------------
+As a fundamental problem in algorithmic trading, order execution aims at fulfilling a specific trading order, either liquidation or acquirement, for a given instrument. Essentially, the goal of order execution is twofold: it not only requires to fulfill the whole order but also targets a more economical execution with maximizing profit gain (or minimizing capital loss). The order execution with only one order of liquidation or acquirement is called single-asset order execution.
+
+Considering stock investment always aim to pursue long-term maximized profits, is usually behaved in the form of a sequential process of continuously adjusting the asset portfolio, execution for multiple orders, including order of liquidation and acquirement, brings more constraints and making the sequence of execution for different orders should be considered, e.g. before executing an order to buy some stocks, we have to sell at least one stock. The order execution with multiple assets is called multi-asset order execution. 
+
+According to the order execution’s trait of sequential decision making, an RL-based solution could be applied to solve the order execution. With an RL-based solution, an agent optimizes execution strategy through interacting with the market environment. 
+
+With QlibRL, the RL algorithm in the above scenarios can be easily implemented.
+
+Example
+============
+QlibRL provides a set of APIs for developers to further simplify their development. For example, if developers have already defined their simulator / interpreters / reward function / policy, they could launch the training pipeline by simply running:
+
+.. code-block:: python
+    train(  
+        simulator_fn=partial(SingleAssetOrderExecution, data_dir=DATA_DIR, ticks_per_step=30),  
+        state_interpreter=state_interp,  
+        action_interpreter=action_interp,  
+        initial_states=orders,  
+        policy=policy,  
+        reward=PAPenaltyReward(),  
+        vessel_kwargs={
+            "episode_per_iter": 100, 
+            "update_kwargs": {
+                "batch_size": 64, 
+                "repeat": 5,
+            },
+        },  
+        trainer_kwargs={
+            "max_iters": 2, 
+            "loggers": ConsoleWriter(total_episodes=100),
+        },  
+    )
+
+We demonstrate an example of an implementation of a single asset order execution task based on QlibRL, the details about the example can be found `here <../../examples/rl/README.md>`_.
diff --git a/docs/index.rst b/docs/index.rst
@@ -44,6 +44,7 @@ Document Structure
    Qlib Recorder: Experiment Management <component/recorder.rst>
    Analysis: Evaluation & Results Analysis <component/report.rst>
    Online Serving: Online Management & Strategy & Tool <component/online.rst>
+   Reinforcement Learning <component/rl.rst>
 
 .. toctree::
    :maxdepth: 3

diff --git a/docs/introduction/introduction.rst b/docs/introduction/introduction.rst
@@ -23,30 +23,37 @@ At the module level, Qlib is a platform that consists of above components. The c
 
 
 
-========================  ==============================================================================
-Name                      Description
-========================  ==============================================================================
-`Infrastructure` layer    `Infrastructure` layer provides underlying support for Quant research.
-                          `DataServer` provides high-performance infrastructure for users to manage
-                          and retrieve raw data. `Trainer` provides flexible interface to control
-                          the training process of models which enable algorithms controlling the
-                          training process.
-
-`Workflow` layer          `Workflow` layer covers the whole workflow of quantitative investment.
-                          `Information Extractor` extracts data for models. `Forecast Model` focuses
-                          on producing all kinds of forecast signals (e.g. *alpha*, risk) for other
-                          modules.  With these signals `Decision Generator` will generate the target
-                          trading decisions(i.e. portfolio, orders)  to be executed by `Execution Env`
-                          (i.e. the trading market).  There may be multiple levels of `Trading Agent`
-                          and `Execution Env` (e.g. an *order executor trading agent and intraday
-                          order execution environment* could behave like an interday trading
-                          environment and nested in  *daily portfolio management trading agent and
-                          interday trading environment*  )
-
-`Interface` layer         `Interface` layer tries to present a user-friendly interface for the underlying
-                          system. `Analyser` module will provide users detailed analysis reports of
-                          forecasting signals, portfolios and execution results
-========================  ==============================================================================
+===========================  ==============================================================================
+Name                         Description
+===========================  ==============================================================================
+`Infrastructure` layer       `Infrastructure` layer provides underlying support for Quant research.
+                             `DataServer` provides high-performance infrastructure for users to manage
+                             and retrieve raw data. `Trainer` provides flexible interface to control
+                             the training process of models which enable algorithms controlling the
+                             training process.
+
+`Workflow` layer             `Workflow` layer covers the whole workflow of quantitative investment.
+                             `Information Extractor` extracts data for models. `Forecast Model` focuses
+                             on producing all kinds of forecast signals (e.g. *alpha*, risk) for other
+                             modules.  With these signals `Decision Generator` will generate the target
+                             trading decisions(i.e. portfolio, orders)  to be executed by `Execution Env`
+                             (i.e. the trading market).  There may be multiple levels of `Trading Agent`
+                             and `Execution Env` (e.g. an *order executor trading agent and intraday
+                             order execution environment* could behave like an interday trading
+                             environment and nested in  *daily portfolio management trading agent and
+                             interday trading environment*  )
+
+`Learning Framework` layer   The `Forecast Model` and `Trading Agent` are learnable. They are learned
+                             based on the `Learning Framework` layer and then applied to multiple scenarios
+                             in `Workflow` layer. The supported learning paradigms can be categorized into
+                             reinforcement learning and supervised learning.  The learning framework
+                             leverages the `Workflow` layer as well(e.g. sharing `Information Extractor`,
+                             creating environments based on `Execution Env`).
+
+`Interface` layer            `Interface` layer tries to present a user-friendly interface for the underlying
+                             system. `Analyser` module will provide users detailed analysis reports of
+                             forecasting signals, portfolios and execution results
+===========================  ==============================================================================
 
 - The modules with hand-drawn style are under development and will be released in the future.
 - The modules with dashed borders are highly user-customizable and extendible.