Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit 3 proposal updates #481

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions units/en/unit3/deep-q-algorithm.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -94,11 +94,11 @@ We face a simple problem by calculating the TD target: how are we sure that **t

We know that the accuracy of Q-values depends on what action we tried **and** what neighboring states we explored.

Consequently, we don’t have enough information about the best action to take at the beginning of the training. Therefore, taking the maximum Q-value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly **given a higher Q value than the optimal best action, the learning will be complicated.**
Consequently, we don’t have enough information about the best action to take at the beginning of the training. Therefore, taking the action with the maximum Q-value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly **given a higher Q value than the optimal best action, the learning will be complicated.**

The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q-value generation. We:
- Use our **DQN network** to select the best action to take for the next state (the action with the highest Q-value).
- Use our **Target network** to calculate the target Q-value of taking that action at the next state.
- Use our **DQN network** to select the best action to take at the current state (the action with the highest Q-value)
- Use our **Target network** to calculate the target Q-value of this _(state, action)_ pair: as the sum of the immediate reward and of an estimation of the Q-value of the next state, given by this target network.

Therefore, Double DQN helps us reduce the overestimation of Q-values and, as a consequence, helps us train faster and with more stable learning.

Expand Down
4 changes: 2 additions & 2 deletions units/en/unit3/glossary.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ In order to obtain temporal information, we need to **stack** a number of frames
our Deep Q-Network after certain **C steps**.

- **Double DQN:** Method to handle **overestimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **Value generation**:
- **DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**)
- **Target Network** to calculate the target **Q-Value** of taking that action at the next state.
- **DQN Network** to select the best actions to take during sampling (i.e. the actions with the highest **Q-Value**) and to estimate the values of these actions during training
- **Target Network** to calculate the target **Q-Value** of the selected _(state, action)_ pairs.
This approach reduces the **Q-Values** overestimation, it helps to train faster and have more stable learning.

If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
Expand Down
4 changes: 2 additions & 2 deletions units/en/unit3/quiz.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -94,9 +94,9 @@ But, with experience replay, **we create a replay buffer that saves experience s

When we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:

- Use our *DQN network* to **select the best action to take for the next state** (the action with the highest Q value).
- Use our *DQN network* to **select the best action to take at the current state** (the action with the highest Q value)

- Use our *Target network* to calculate **the target Q value of taking that action at the next state**.
- Use our *Target network* to calculate **the target Q value of this _(state, action)_ pair**: as the sum of the immediate reward and of an estimation of the Q-value of the next state, given by this target network.

</details>

Expand Down
Loading