huggingface · PierreCounathe · Feb 1, 2024
diff --git a/units/en/unit3/deep-q-algorithm.mdx b/units/en/unit3/deep-q-algorithm.mdx
@@ -94,11 +94,11 @@ We face a simple problem by calculating the TD target: how are we sure that **t
 
 We know that the accuracy of Q-values depends on what action we tried **and** what neighboring states we explored.
 
-Consequently, we don’t have enough information about the best action to take at the beginning of the training. Therefore, taking the maximum Q-value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly **given a higher Q value than the optimal best action, the learning will be complicated.**
+Consequently, we don’t have enough information about the best action to take at the beginning of the training. Therefore, taking the action with the maximum Q-value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly **given a higher Q value than the optimal best action, the learning will be complicated.**
 
 The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q-value generation. We:
-- Use our **DQN network** to select the best action to take for the next state (the action with the highest Q-value).
-- Use our **Target network** to calculate the target Q-value of taking that action at the next state.
+- Use our **DQN network** to select the best action to take at the current state (the action with the highest Q-value)
+- Use our **Target network** to calculate the target Q-value of this _(state, action)_ pair: as the sum of the immediate reward and of an estimation of the Q-value of the next state, given by this target network.
 
 Therefore, Double DQN helps us reduce the overestimation of Q-values and, as a consequence, helps us train faster and with more stable learning.
 

diff --git a/units/en/unit3/glossary.mdx b/units/en/unit3/glossary.mdx
@@ -27,8 +27,8 @@ In order to obtain temporal information, we need to **stack** a number of frames
   our Deep Q-Network after certain **C steps**. 
 
   - **Double DQN:** Method to handle **overestimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **Value generation**:
-     - **DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**)
-     - **Target Network** to calculate the target **Q-Value** of taking that action at the next state. 
+     - **DQN Network** to select the best actions to take during sampling (i.e. the actions with the highest **Q-Value**) and to estimate the values of these actions during training
+     - **Target Network** to calculate the target **Q-Value** of the selected _(state, action)_ pairs.
 This approach reduces the **Q-Values** overestimation, it helps to train faster and have more stable learning.
 
 If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)

diff --git a/units/en/unit3/quiz.mdx b/units/en/unit3/quiz.mdx
@@ -94,9 +94,9 @@ But, with experience replay, **we create a replay buffer that saves experience s
 
   When we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:
 
-  - Use our *DQN network* to **select the best action to take for the next state** (the action with the highest Q value).
+  - Use our *DQN network* to **select the best action to take at the current state** (the action with the highest Q value)
 
-  - Use our *Target network* to calculate **the target Q value of taking that action at the next state**.
+  - Use our *Target network* to calculate **the target Q value of this _(state, action)_ pair**: as the sum of the immediate reward and of an estimation of the Q-value of the next state, given by this target network.
 
 </details>