generated from jobindjohn/obsidian-publish-mkdocs
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 10.md * PUSH ATTACHMENT : Pasted image 20241020163624.png * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 9.md * PUSH ATTACHMENT : Pasted image 20241020160432.png * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 11.md * PUSH ATTACHMENT : Pasted image 20241020202242.png
- Loading branch information
Showing
6 changed files
with
239 additions
and
11 deletions.
There are no files selected for viewing
30 changes: 30 additions & 0 deletions
30
...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 10.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
--- | ||
authors: | ||
- "[[Richard S. Sutton|Richard S. Sutton]]" | ||
- "[[Andrew G. Barton|Andrew G. Barton]]" | ||
year: 2018 | ||
tags: | ||
- textbook | ||
- rl1 | ||
url: | ||
share: true | ||
--- | ||
# 10 On-Policy Control with Approximation | ||
|
||
Now that we know how to learn value functions, we can tackle the control problem by learning q-value functions instead and using a $\epsilon$-greedy policy over those. | ||
|
||
## 10.1 Episodic Semi-gradient Control | ||
|
||
> [!NOTE] Equation 10.1: General gradient-descent update for action-value prediction | ||
> | ||
> $$ | ||
> \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \left[U_t - \hat{q}(S_t, A_t, \mathbf{w}_t) \right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t) \tag{10.1} | ||
> $$ | ||
> [!NOTE] Equation 10.2: Episodic semi-gradient one-step SARSA | ||
> | ||
> $$ | ||
> \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \left[R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t) \tag{10.2} | ||
> $$ | ||
![[Pasted image 20241020163624.png|700]] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.