Skip to content

Commit

Permalink
[PUBLISHER] Merge #40
Browse files Browse the repository at this point in the history
* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 9.md

* PUSH ATTACHMENT : Pasted image 20240923171752.png

* PUSH ATTACHMENT : Pasted image 20240923172823.png

* PUSH ATTACHMENT : Pasted image 20240923173826.png
  • Loading branch information
dgcnz authored Sep 23, 2024
1 parent 84598ba commit 4765516
Show file tree
Hide file tree
Showing 4 changed files with 105 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ New notation! ($s\to u$ is an update rule for $v(s)$ using new expression $u$)
>
> $$
> \begin{align}
> \overline{VE}(\mathbf{w}) &\doteq \sum_{s \in \mathcal{S}} \mu(s) \left[v_{\pi}(s) - \hat{v}(s, \mathbf{w})\right]^2 && (9.1)
> \overline{VE}(\mathbf{w}) &\doteq \sum_{s \in \mathcal{S}} \mu(s) \left[v_{\pi}(s) - \hat{v}(s, \mathbf{w})\right]^2 && \tag{9.1}
> \end{align}
> $$
Expand All @@ -54,13 +54,13 @@ For on-policy episodic tasks, $\mu(s)$ is called the *on-policy distribution*, w
>
> $$
> \begin{align}
> \eta(s) = h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \mid \bar{s})p(s \mid \bar{s}, a), && \text{for all } s \in S && (9.2)
> \eta(s) = h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \mid \bar{s})p(s \mid \bar{s}, a), && \text{for all } s \in S && \tag{9.2}
> \end{align}
> $$
>
> $$
> \begin{align}
> \mu(s) = \frac{\eta(s)}{\sum_{s'}\eta(s')} && (9.3)
> \mu(s) = \frac{\eta(s)}{\sum_{s'}\eta(s')} && \tag{9.3}
> \end{align}
> $$
Expand All @@ -73,4 +73,105 @@ Where:
- $\overline{VE}$ only guaranties local optimality.


## 9.3 Stochastic-gradient and Semi-gradient Methods
## 9.3 Stochastic-gradient and Semi-gradient Methods

> [!NOTE] Equations 9.4 and 9.5
>
> $$
> \begin{align}
> \mathbf{w}_{t+1} &= \mathbf{w}_t - \frac{1}{2} \alpha \nabla \left[v_{\pi}(S_t) - \hat{v}(S_t, \mathbf{w}_t) \right] && \tag{9.4} \\
> &= \mathbf{w}_t + \alpha \left[v_{\pi}(S_t) - \hat{v}(S_t, \mathbf{w}_t) \right] \nabla \hat{v}(S_t, \mathbf{w}_t) && \tag{9.5}
> \end{align}
> $$
However, since we don't know the true $v_\pi(s)$, we can replace it with the *target output* $U_t$:

> [!NOTE] Equation 9.7
>
> $$
> \begin{align}
> \mathbf{w}_{t+1} &= \mathbf{w}_t + \alpha \left[U_t - \hat{v}(S_t, \mathbf{w}_t) \right] \nabla \hat{v}(S_t, \mathbf{w}_t) && \tag{9.7}
> \end{align}
> $$
Where:
- $U_t$ *should* be an unbiased estimate of $v_\pi(s)$, that is:
- $\mathbb{E}[U_t \mid S_t=s] = v_\pi(s)$
- With local optimum convergence guarantees.

![[Pasted image 20240923171752.png|Pasted image 20240923171752.png]]

Examples of $U_t$:
- Monte Carlo target: $U_t = G_t$ (that is, the reward achieved until the end of the episode), unbiased.
- Bootstrapping targets are biased because they depend on $\mathbf{w}$ through $\hat{v}(S_t, \mathbf{w})$ .
- To make them unbiased, you can treat the dependent expressions as constants (stop the gradient flow). This yields *semi-gradient methods*.

*Semi-gradient methods*:
- Do not converge as robustly as gradient methods, aside from the linear case.
- Faster, enable online/continual learning.

![[Pasted image 20240923172823.png|Pasted image 20240923172823.png]]

## 9.4 Linear Methods

> [!NOTE] Equation 9.8
>
> $$
> \begin{align}
> \hat{v}(s, \mathbf{w}) \doteq \mathbf{w}^\intercal \mathbf{x}(s) = \sum_{i=1}^d w_i x_i(s) && \tag{9.8}
> \end{align}
> $$
>
> Where:
> - $\mathbf{x}(s) = \left(x_1(s), \dots, x_d(s)\right)^\intercal$
- Chapter also explores the convergence of TD(0) with SGD and linear approximation and finds it converges to the *TD fixed point* (Eqs. 9.11, 9.12), $\mathbf{w}_{TD}$.


> [!NOTE] Equation 9.14
>
> Interpretation: The asymptotic error of the TD method is no more than $\frac{1}{1-\gamma}$ times the *smallest possible error*.
>
> $$
> \begin{align}
> \overline{VE}(\mathbf{w}_{TD}) & \leq \frac{1}{1-\gamma} \min_{\mathbf{w}} \overline{VE}(\mathbf{w}) \tag{9.14}
> \end{align}
> $$

![[Pasted image 20240923173826.png|Pasted image 20240923173826.png]]

> [!NOTE] Equation 9.15
>
> $$
> \mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha \left[ G_{t:t+n} - \hat{v}(S_t, \mathbf{w}_{t+n-1}) \right] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1}), \quad 0 \leq t < T, \tag{9.15}
> $$
> [!NOTE] Equation 9.16
>
> $$
> G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \quad 0 \leq t \leq T - n. \tag{9.16}
> $$

## 9.5 Feature Construction for Linear Methods

- 9.5.1 Polynomials
- 9.5.2 Fourier Basis
- 9.5.3 Coarse coding
- 9.5.4 Tile Coding
- 9.5.5 Radial Basis Functions

## 9.6 Selecting Step-Size Parameters Manually


> [!NOTE] Equation 9.19
> A good rule of thumb for setting the step-size parameter of *linear SGD methods* is:
>
> $$
> \begin{align}
> \alpha \doteq \left(\tau \mathbb{E}\left[\mathbf{x}^\intercal\mathbf{x}\right]\right)^{-1} \tag{9.19}
> \end{align}
> $$

Binary file added docs/images/Pasted image 20240923171752.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20240923172823.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20240923173826.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 4765516

Please sign in to comment.