Skip to content

Commit

Permalink
[PUBLISHER] Merge #49
Browse files Browse the repository at this point in the history
* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 10.md

* PUSH ATTACHMENT : Pasted image 20241020163624.png

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 9.md

* PUSH ATTACHMENT : Pasted image 20241020160432.png

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 11.md

* PUSH ATTACHMENT : Pasted image 20241020202242.png
  • Loading branch information
dgcnz authored Oct 20, 2024
1 parent 17ca25b commit 77d3236
Show file tree
Hide file tree
Showing 6 changed files with 239 additions and 11 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
authors:
- "[[Richard S. Sutton|Richard S. Sutton]]"
- "[[Andrew G. Barton|Andrew G. Barton]]"
year: 2018
tags:
- textbook
- rl1
url:
share: true
---
# 10 On-Policy Control with Approximation

Now that we know how to learn value functions, we can tackle the control problem by learning q-value functions instead and using a $\epsilon$-greedy policy over those.

## 10.1 Episodic Semi-gradient Control

> [!NOTE] Equation 10.1: General gradient-descent update for action-value prediction
>
> $$
> \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \left[U_t - \hat{q}(S_t, A_t, \mathbf{w}_t) \right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t) \tag{10.1}
> $$
> [!NOTE] Equation 10.2: Episodic semi-gradient one-step SARSA
>
> $$
> \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \left[R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \right] \nabla \hat{q}(S_t, A_t, \mathbf{w}_t) \tag{10.2}
> $$
![[Pasted image 20241020163624.png|700]]
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,16 @@ tags:
url:
share: true
---
# 11 Off-policy Methods with Approximation

Off-policy case is a bit tricky, previously [[Reinforcement Learning - An Introduction - Chapter 5|Reinforcement Learning - An Introduction - Chapter 5]] we adjusted our target $G_t$ with the importance sampling ratio so that its expectation was $v_{\pi}$ and not $v_b$.

However, for semi-gradient methods with function approximation, another factor comes in: By following the behavior policy, the **distribution of updates** is also biased!

There are 2 solutions:
1. **Importance sampling** - Corrects the distribution of updates
2. **True gradient methods** - To avoid relying on the distribution for stability

## 11.1 Semi-gradient Methods

> [!NOTE] Equation 11.1: Per-step importance sampling ratio
Expand All @@ -17,15 +27,54 @@ share: true
> \rho_t \doteq \rho_{t:T-1} = \frac{\pi(A_t \mid S_t)}{b(A_t \mid S_t)}
> $$
> [!FAQ]- What does a high/low importance sampling ratio imply about the action taken?
> - High: The action was taken rarely under the behavior policy
> - Low: The action was taken often under the behavior policy
#todo
> [!NOTE] Equation 11.2: Update rule for semi-gradient off-policy TD(0)
>
> $$
> \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \rho_t \delta_t \nabla \hat{v}(S_t, \mathbf{w}_t) \tag{11.2}
> $$
> [!NOTE] Equation 11.3: Definition of $\delta_t$ for semi-gradient off-policy TD(0)
>
> $$
> \delta_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}_t) - \hat{v}(S_t, \mathbf{w}_t) \tag{11.3}
> $$
> [!NOTE] Equation 11.5: Update rule for semi-gradient off-policy SARSA
>
> $$
> \mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \rho_t \delta_t \nabla \hat{q}(S_t, A_t, \mathbf{w}_t) \tag{11.5}
> $$
> [!NOTE] Equation 11.6: Definition of $\delta_t$ for semi-gradient off-policy SARSA
>
> $$
> \delta_t \doteq R_{t+1} + \gamma \sum_a \pi(a \mid S_{t+1}) \hat{q}(S_{t+1}, a, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t) \tag{11.6}
> $$
Also:
- eq 11.6: n-step version of semi-gradient Sarsa
- eq 11.7: semi-gradient version of n-step tree-backup algorithm

## 11.2 Examples of Off-policy Divergence

Baird's counterexample: Shows that even the simplest bootstrapping (DP, TD) with function approximation can diverge if updates aren't done under on-policy distribution.

> Another way to try to prevent instability is to use special methods for function approximation. In particular, stability is guaranteed for function approximation methods that do not extrapolate from the observed targets. These methods, called averagers, include nearest neighbor methods and locally weighted regression, but not popular methods such as tile coding and artificial neural networks (ANNs).
## 11.3 The Deadly Triad

Instability arises when we combine the following 3 elements:
1. Function approximation
2. Bootstrapping
3. Off-policy training

## 11.4 Linear Value-function Geometry

TODO:
- [x] 11.11 mu norm equation ✅ 2024-10-01
- [x] 11.17 and 11.18 bellman error ✅ 2024-10-01
- [x] 11.19 mean square bellman error ✅ 2024-10-01
TLDR: Using the geometry of the value function, we find that $\overline{BE}$ measures how far off $\mathbf{w}$ is from $v_\pi$.

> [!NOTE] Equation 11.11: $\mu$-norm
>
Expand All @@ -50,6 +99,10 @@ TODO:
## 11.5 Gradient Descent in the Bellman Error

TLDR: Semi-gradient SGD might diverge, but true SGD doesn't! Sadly, both TDE and BE yield bad minima.

Let's first take an easier case, minimizing $\overline{TDE}$.

> [!NOTE] Mean-squared temporal difference error
>
> $$
Expand All @@ -60,7 +113,9 @@ TODO:
> \end{align}
> $$
> [!NOTE] Equation 11.23: Weight update of naive residual-gradient algoritm
This yields the following.

> [!NOTE] Equation 11.23: Weight update of naive residual-gradient algorithm
>
> $$
> \begin{align}
Expand All @@ -70,3 +125,79 @@ TODO:
> \end{align}
> $$
Conclusion:
> Minimizing the TDE is naive; by penalizing all TD errors it achieves something more like temporal smoothing than accurate prediction.
> Although the naive residual-gradient algorithm converges robustly, it does not necessarily converge to a desirable place.
Doing the same for $\overline{BE}$ yields the residual-gradient algorithm. But it's not a good choice either.
## 11.6 The Bellman Error is Not Learnable

TLDR: $\overline{BE}$ is not learnable but $\overline{TDE}$ and $\overline{PBE}$ are. Since minimizing $\overline{TDE}$ is naive, next section: minimize $\overline{PBE}$.

> Here we use the term in a more basic way, to mean learnable at all, with any amount of experience. It turns out many quantities of apparent interest in reinforcement learning cannot be learned even from an infinite amount of experiential data. These quantities are well defined and can be computed given knowledge of the internal structure of the environment, but cannot be computed or estimated from the observed sequence of feature vectors, actions, and rewards
## 11.7 Gradient-TD Methods

TLDR: To minimize $\overline{PBE}$ using SGD efficiently we use two separate estimates for dependent expectations. This yields two algorithms: GTD2 and TDC.

> [!NOTE] Equation 11.27: Gradient of $\overline{PBE}$
>
> $$
> \nabla \overline{PBE}(\mathbf{w}) = 2 \mathbb{E} \left[ \rho_t (\gamma_{t+1} - \mathbf{x}_t) \mathbf{x}_t^\top \right] \mathbb{E} \left[ \mathbf{x}_t \mathbf{x}_t^\top \right]^{-1} \mathbb{E} \left[ \rho_t \delta_t \mathbf{x}_t \right]
> $$
First and last term are not independent, thus we must have separate estimates.

> [!NOTE] Equation 11.28: Definition of $\mathbf{v}$
>
> Grouping the last two terms of the gradient of $\overline{PBE}$, we get a new vector $\mathbf{v}\in \mathbb{R}^d$ which we can estimate and store efficiently:
>
> $$
> \mathbf{v} \approx \mathbb{E} \left[ \mathbf{x}_t \mathbf{x}_t^\top \right]^{-1} \mathbb{E} \left[ \rho_t \delta_t \mathbf{x}_t \right]
> $$
> [!NOTE] Update rule for $\mathbf{v}$
>
> $$
> \mathbf{v}_{t+1} = \mathbf{v}_t + \beta \rho_t (\delta_t - \mathbf{v}_t^\top \mathbf{x}_t) \mathbf{x}_t
> $$
>
> Where:
> - $\beta$ is a learning rate
Using $\mathbf{v}$, we can now update $\mathbf{w}$, which yields the GTD2 algorithm:

> [!NOTE] Equation 11.29: Update rule for GTD2
>
> $$
> \begin{align}
> \mathbf{w}_{t+1} &= \mathbf{w}_t - \frac{1}{2}\alpha \nabla \text{PBE}(\mathbf{w}_t) \quad \tag{the general SGD rule} \\
> &= \mathbf{w}_t - \frac{1}{2}\alpha 2 \mathbb{E} \left[ \rho_t (\gamma_{t+1} - \mathbf{x}_t) \mathbf{x}_t^\top \right] \mathbb{E} \left[ \mathbf{x}_t \mathbf{x}_t^\top \right]^{-1} \mathbb{E} \left[ \rho_t \delta_t \mathbf{x}_t \right] \quad
> \tag{from (11.27)} \\
> &= \mathbf{w}_t + \alpha \mathbb{E} \left[ \rho_t (\mathbf{x}_t - \gamma \mathbf{x}_{t+1}) \mathbf{x}_t^\top \right] \mathbb{E} \left[ \mathbf{x}_t \mathbf{x}_t^\top \right]^{-1} \mathbb{E} \left[ \rho_t \delta_t \mathbf{x}_t \right] \quad \tag{11.29} \\
> &\approx \mathbf{w}_t + \alpha \mathbb{E} \left[ \rho_t (\mathbf{x}_t - \gamma \mathbf{x}_{t+1}) \mathbf{x}_t^\top \right] \mathbf{v}_t \quad \tag{based on (11.28)} \\
> &\approx \mathbf{w}_t + \alpha \rho_t (\mathbf{x}_t - \gamma \mathbf{x}_{t+1}) \mathbf{x}_t^\top \mathbf{v}_t \quad \tag{sampling}
> \end{align}
> $$
> [!FAQ]- What is the complexity of GTD2?
> $O(d)$ per step if $x_t^T v_t$ computed first
We can do a bit more algebra to get to a better algorithm: TDC (TD(0) with Gradient Correction) or also known as GTD(0).

> [!NOTE] Equation: Update rule for TDC
>
> $$
> \begin{align}
> \mathbf{w}_{t+1} &= \mathbf{w}_t + \alpha \mathbb{E} \left[ \rho_t (\mathbf{x}_t - \gamma \mathbf{x}_{t+1}) \mathbf{x}_t^\top \right] \mathbb{E} \left[ \mathbf{x}_t \mathbf{x}_t^\top \right]^{-1} \mathbb{E} \left[ \rho_t \delta_t \mathbf{x}_t \right] \\
> &= \mathbf{w}_t + \alpha \left( \mathbb{E} \left[ \rho_t \mathbf{x}_t \mathbf{x}_t^\top \right] - \gamma \mathbb{E} \left[ \rho_t \mathbf{x}_{t+1} \mathbf{x}_t^\top \right] \right) \mathbb{E} \left[ \mathbf{x}_t \mathbf{x}_t^\top \right]^{-1} \mathbb{E} \left[ \rho_t \delta_t \mathbf{x}_t \right] \\
> &= \mathbf{w}_t + \alpha \left( \mathbb{E} \left[ \mathbf{x}_t \mathbf{x}_t^\top \right] - \gamma \mathbb{E} \left[ \rho_t \mathbf{x}_{t+1} \mathbf{x}_t^\top \right] \right) \mathbb{E} \left[ \mathbf{x}_t \mathbf{x}_t^\top \right]^{-1} \mathbb{E} \left[ \rho_t \delta_t \mathbf{x}_t \right] \\
> &= \mathbf{w}_t + \alpha \left( \mathbb{E} \left[ \rho_t \delta_t \mathbf{x}_t \right] - \gamma \mathbb{E} \left[ \rho_t \mathbf{x}_{t+1} \mathbf{x}_t^\top \right] \mathbb{E} \left[ \mathbf{x}_t \mathbf{x}_t^\top \right]^{-1} \mathbb{E} \left[ \rho_t \delta_t \mathbf{x}_t \right] \right) \\
> &\approx \mathbf{w}_t + \alpha \left( \mathbb{E} \left[ \rho_t \delta_t \mathbf{x}_t \right] - \gamma \mathbb{E} \left[ \rho_t \mathbf{x}_{t+1} \mathbf{x}_t^\top \right] \mathbf{v}_t \right) \quad &\text{(based on (11.28))} \\
> &\approx \mathbf{w}_t + \alpha \rho_t \left( \delta_t \mathbf{x}_t - \gamma \mathbf{x}_{t+1} \mathbf{x}_t^\top \mathbf{v}_t \right), \quad &\text{(sampling)}
> \end{align}
> $$
Summary from the lectures

![[Pasted image 20241020202242.png|700]]
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ New notation! ($s\to u$ is an update rule for $v(s)$ using new expression $u$)
>
> Without these two, we must say which states are most important to us.
> [!NOTE] Equation 9.1
> [!NOTE] Equation 9.1: Value Error
>
> $$
> \begin{align}
Expand Down Expand Up @@ -100,7 +100,7 @@ Where:
- $\mathbb{E}[U_t \mid S_t=s] = v_\pi(s)$
- With local optimum convergence guarantees.

![[Pasted image 20240923171752.png|Pasted image 20240923171752.png]]
![[Pasted image 20240923171752.png|800]]

Examples of $U_t$:
- Monte Carlo target: $U_t = G_t$ (that is, the reward achieved until the end of the episode), unbiased.
Expand All @@ -111,7 +111,7 @@ Examples of $U_t$:
- Do not converge as robustly as gradient methods, aside from the linear case.
- Faster, enable online/continual learning.

![[Pasted image 20240923172823.png|Pasted image 20240923172823.png]]
![[Pasted image 20240923172823.png|800]]

## 9.4 Linear Methods

Expand All @@ -127,7 +127,31 @@ Examples of $U_t$:
> - $\mathbf{x}(s) = \left(x_1(s), \dots, x_d(s)\right)^\intercal$
- The gradient Monte Carlo algorithm converges to the global optimum of the VE under linear function approximation if $\alpha$ is reduced over time according to the usual conditions.
- Chapter also explores the convergence of TD(0) with SGD and linear approximation and finds it converges to the *TD fixed point* (Eqs. 9.11, 9.12), $\mathbf{w}_{TD}$. This is not the global optimum, but a point near the local optimum.
- Chapter also explores the convergence of TD(0) with SGD and linear approximation and finds it converges to the *TD fixed point* (Eqs. 9.11, 9.12), $\mathbf{w}_{TD}$. This is not the global optimum, but a point near the local optimum.

> [!NOTE] Equation 9.11 and 9.12: TD fixed point
>
> Semi-gradient TD(0) under linear approximation converges to the *TD fixed point*:
>
> $$
> \mathbf{w}_{TD} = \mathbf{A}^{-1}\mathbf{b} \tag{9.12}
> $$
>
> Where:
> $$
> \mathbf{b} \doteq \mathbb{E} \left[ R_{t+1} \mathbf{x}_t \right] \in \mathbb{R}^d \quad \text{and} \quad \mathbf{A} \doteq \mathbb{E} \left[ \mathbf{x}_t (\mathbf{x}_t - \gamma \mathbf{x}_{t+1})^\intercal \right] \in \mathbb{R}^{d \times d} \tag{9.11}
> $$
> [!FAQ]- Is $\mathbf{w}_{TD}$ the minimiser of $\overline{VE}$?
> No, but it is a point near a local optimum w.r.t $\overline{VE}$.
> [!FAQ]- What are the convergence guarantees of semi-gradient TD(0) with non-linear features?
> No guarantees.
> [!FAQ]- When should you choose Semi-gradient TD(0) over Gradient Monte Carlo?
> Since Semi-gradient TD(0) learns faster, it is better when there is a fixed computational budget.
> However, if you can train for longer, Gradient Monte Carlo is better.


> [!NOTE] Equation 9.14
Expand All @@ -141,7 +165,7 @@ Examples of $U_t$:
> $$

![[Pasted image 20240923173826.png|Pasted image 20240923173826.png]]
![[Pasted image 20240923173826.png|800]]

> [!NOTE] Equation 9.15
>
Expand Down Expand Up @@ -179,3 +203,46 @@ Examples of $U_t$:
> $$

## 9.7 Nonlinear Function Approximation: Artificial Neural Networks


TLDR: Using neural networks for function approximation. Discusses common techniques: architecture, dropout, batchnorm, etc.

## 9.8 Least-Squares TD

> [!NOTE] Equation 9.20 and 9.21: LSTD update
>
> $$
> \mathbf{w}_t \doteq \widehat{\mathbf{A}}_t^{-1}\widehat{\mathbf{b}}_t \tag{9.21}
> $$
>
> Where:
>
> $$
> \widehat{\mathbf{A}}_t \doteq \sum_{k=0}^{t-1} \mathbf{x}_k(\mathbf{x}_k - \gamma \mathbf{x}_{k+1})^\intercal + \epsilon \mathbf{I} \quad \text{and} \quad \widehat{\mathbf{b}}_t \doteq \sum_{k=0}^{t-1} \mathbf{x}_k R_{k+1} \tag{9.20}
> $$

> [!FAQ]- Shouldn't both sums be divided by $t$ like normal sample averages?
> No, the $t$ term cancels out in the update.
> [!FAQ]- Why id there a $\epsilon \mathbf{I}$ term in the LSTD update?
> This is a regularization term to ensure the matrix is invertible.
> [!FAQ]- What is the computational complexity of LSTD?
> By naively computing the inverse, it is $O(d^3)$, but there are more efficient methods.
> [!FAQ]- How can we improve the computational complexity of LSTD?
> By using the Sherman-Morrison formula, we can reduce the complexity to $O(d^2)$. This method allows us to iteratively compute $\widehat{\mathbf{A}}_t^{-1}$. from $\widehat{\mathbf{A}}_{t-1}^{-1}$ in $O(d^2)$ time.
> [!FAQ]- Does LSTD require specifying any hyperparameters (e.g. step size)?
> Yes, but not step size. The only hyperparameter is $\epsilon$. It has a similar effect to step size: If $\epsilon$ is too small, the inverses will vary a lot and if $\epsilon$ is is too large, learning will be slow.
> [!FAQ]- How does LSTD compare to Semi-Gradient TD? Name 4 differences.
> 1. More sample efficient.
> 2. Requires more computation: $O(d^2)$ vs $O(d)$.
> 3. Does not require a step-size parameter, only $\epsilon$.
> 4. LSTD never forgets.
![[Pasted image 20241020160432.png|800]]
- Note: Details not on exam
Binary file added docs/images/Pasted image 20241020160432.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20241020163624.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20241020202242.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 77d3236

Please sign in to comment.