Mountain Car (OpenAI gym) - Double Deep Q Network with Prioritized Experience Replay
- Python 3.6.4
$ pip install -r requirements.txt
$ python main.py [-h] [-i ITERATION] [-m MEMORYSIZE] [-b BATCHSIZE] [-lr LEARNINGRATE] [-hu HUMANEXP] [-hu--out HUMANEXPOUT] [-score--out SCOREOUT] [-screen SCREEN]
optional Options | Description |
---|---|
-h, --help | show this help message and exit |
-i ITERATION | input the iteration of training |
-m MEMORYSIZE | input the size of memory |
-b BATCHSIZE | input the size of batch |
-lr LEARNINGRATE | input learning rate |
-hu HUMANEXP | input human experience |
-hu--out HUMANEXPOUT | human experience output path |
-score--out SCOREOUT | score output path |
-screen SCREEN | show the screen of game (true/false) |
- Double Deep Q Network With Prioritized Experience Replay
-
According to OpenAI gym, we have:
- Observation
Num Observation Min Max 0 position -1.2 0.6 1 velocity -0.07 0.07 -
- The range of reward will be -1 to +1
- it makes car moving with higher velocity and closer edge
- Prioritized experience replay have 2 issues
- lead to a loss of diversity:
- Using stochastic prioritization, it makes all transition will be sampled with some probability. Therefor, this problem will alleviate.
- Introduce bias:
- This algorthm using importance sampling weight to prevent that some high priority transitions become the main of updated transitions.
- lead to a loss of diversity:
- In
src/rl.py
, we have 3 classes includeSumTree
,Memory
andRL
- SumTree
- We store transition and priority in this class.
- It assist non-uniform sampling
- Memory
- This is replay memory, there is sum-tree in this class.
Memory
should calculate priority forSumTree
and importance-sampling weight forRL
, and maintain priorities of each transitions.
- RL
- This is main part of the agent.
- Choocing action via
actor()
. Using ε-greedy method, that is, the agent randomly chooce action with ε probability. As times of updating increass ε will become smaller. - Learning via
learn()
. The agent update self by the transitions that sampling in replay memory, and need product with importance-sampling weight when computing loss.
- SumTree
- Epsilon: 0.5 to 0.1. This value decides greedily or randomly chooce actions. We set 0.5 at the begin, it makes the agent will explore environment. As time passes, the agent become smarter. So we don't need explore with high prabobilty.
- Replay memory size: 10,000
- Parameters of target network update: In the DQN agent, We have 2 neural networks, target network and predict network. In this fomula , target network is and predict network is .Then we use hard update and the period is 500. Means target network will update when predict network training 500 times.
- note 1: The value of y-axis is original reward in episode, it is not used to update agent.
- Original reward: -1 for each time step, until the goal position of 0.5 is reached.
- note 2: No matter which red or blue line are using double deep Q network.
- note 3: Both its replay memory size are 10,000
- In this experiment, replay memory contain same 10,000 transitions that using randomly chooce action at the begining.
- batch size: 16, learning rate: 0.0005
- batch size: 32, learning rate: 0.0005
- batch size: 64, learning rate: 0.0005
- As the batch size increases, double DQN without prioritized experience replay will be more stable (with lower standard deviation).
- However, double DQN with prioritized experience replay remains stable throughout.
- batch size: 32, learning rate: 0.0005
- batch size: 32, learning rate: 0.001
- batch size: 32, learning rate: 0.01
- Although the uniform sampling agent will get higher standard deviation sometime, overall, As the learning rate increases, both its average reward are higher.
- This diagram tell us the agent had learned that it want to push right if speed is positive, conversely, it will want to push left. But the blue (push right) area is larger than red (push left) at right hand side, because the agent knows the goal is not far.
- When we change the batch size
- As the batch size increases, double DQN without prioritized experience replay will be more stable (with lower standard deviation).
- However, double DQN with prioritized experience replay remains stable throughout.
- When we change the learning rate
- Although the uniform sampling agent will get higher standard deviation sometime, overall, As the learning rate increases, both its average reward are higher.
- Prioritized experience replay is better than uniform sampling
- More robust:
- In a prioritized experience replay agent, all new transitions arrive without a known TD-error, so we put them at maximal priority in order to guarantee that all experience is seen at least once. But uniform sampling agent is not so lucky, maybe some important transitions leave replay memory without update.Therefor, its performance will get higher standard deviation sometime.
- Has better score at the beginning:
- In the replay memory, there are almost redundant transitions. Using prioritized experience replay method makes rare and task-relevant transitions are sampled more easily. But uniform spamling method usually sample redundand transitions, lead uniform spamling method will be slower growth. So we can say prioritized experience replay method will adapt to the environment earlier than uniform sampling.
- More robust:
- DQN belong to value-base:
- The DQN agent always choose action via its action-value function (neural network). the action-value function (neural network) will tell which action is best when it want to choose action. In addition, This agent use ε-greedy method, so that it will explore the environment (randomly choose action) with ε probability.
- This algorithms is off-policy:
- An on-policy agent update self based on its current action derived from the current policy, whereas its off-policy counterpart update self based on the action obtained from another policy. In this algorithm, the agent sample transitions in the replay memory, but the replay memory contains a lot of different policies and those are almost not the current policy. So this algorithms is off-policy.
- The replay memory method break the temporal correlations:
- In a deep Q network agent, it often randomly sample transition in replay memory. It means old and new transitions will be mixed, so that the temporal correlations will be broken. In addition, The replay memory method also makes rare experience will be used for more than just signle update.