[PUBLISHER] Merge #40

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 9.md * PUSH ATTACHMENT : Pasted image 20240923171752.png * PUSH ATTACHMENT : Pasted image 20240923172823.png * PUSH ATTACHMENT : Pasted image 20240923173826.png
dgcnz · Sep 23, 2024 · 4765516 · 4765516
1 parent 84598ba
commit 4765516
Show file tree

Hide file tree

Showing 4 changed files with 105 additions and 4 deletions.
diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 9.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 9.md
@@ -41,7 +41,7 @@ New notation! ($s\to u$ is an update rule for $v(s)$ using new expression $u$)
 > 
 > $$
 > \begin{align}
-> \overline{VE}(\mathbf{w}) &\doteq \sum_{s \in \mathcal{S}}  \mu(s) \left[v_{\pi}(s) - \hat{v}(s, \mathbf{w})\right]^2 && (9.1)
+> \overline{VE}(\mathbf{w}) &\doteq \sum_{s \in \mathcal{S}}  \mu(s) \left[v_{\pi}(s) - \hat{v}(s, \mathbf{w})\right]^2 && \tag{9.1}
 > \end{align}
 > $$
 
@@ -54,13 +54,13 @@ For on-policy episodic tasks, $\mu(s)$ is called the *on-policy distribution*, w
 > 
 > $$
 > \begin{align}
-> \eta(s) = h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \mid \bar{s})p(s \mid \bar{s}, a), && \text{for all } s \in S  && (9.2)
+> \eta(s) = h(s) + \sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \mid \bar{s})p(s \mid \bar{s}, a), && \text{for all } s \in S  && \tag{9.2}
 > \end{align}
 > $$
 > 
 > $$
 > \begin{align}
-> \mu(s) = \frac{\eta(s)}{\sum_{s'}\eta(s')} && (9.3)
+> \mu(s) = \frac{\eta(s)}{\sum_{s'}\eta(s')} && \tag{9.3}
 > \end{align}
 > $$
 
@@ -73,4 +73,105 @@ Where:
 - $\overline{VE}$ only guaranties local optimality.
 
 
-## 9.3 Stochastic-gradient and Semi-gradient Methods
+## 9.3 Stochastic-gradient and Semi-gradient Methods
+
+> [!NOTE] Equations 9.4 and 9.5
+> 
+> $$
+> \begin{align}
+> 	\mathbf{w}_{t+1} &= \mathbf{w}_t - \frac{1}{2} \alpha \nabla \left[v_{\pi}(S_t) - \hat{v}(S_t, \mathbf{w}_t) \right] && \tag{9.4} \\
+> 	 &= \mathbf{w}_t + \alpha \left[v_{\pi}(S_t) - \hat{v}(S_t, \mathbf{w}_t) \right] \nabla \hat{v}(S_t, \mathbf{w}_t) && \tag{9.5}
+> \end{align}
+> $$
+
+However, since we don't know the true  $v_\pi(s)$, we can replace it with the *target output* $U_t$: 
+
+> [!NOTE] Equation 9.7
+> 
+> $$
+> \begin{align}
+> 	\mathbf{w}_{t+1} &= \mathbf{w}_t + \alpha \left[U_t - \hat{v}(S_t, \mathbf{w}_t) \right] \nabla \hat{v}(S_t, \mathbf{w}_t) && \tag{9.7}
+> \end{align}
+> $$
+
+Where:
+- $U_t$ *should* be an unbiased estimate of $v_\pi(s)$, that is:
+	- $\mathbb{E}[U_t \mid S_t=s] = v_\pi(s)$
+	- With local optimum convergence guarantees.
+
+![[Pasted image 20240923171752.png|Pasted image 20240923171752.png]]
+
+Examples of $U_t$:
+- Monte Carlo target: $U_t = G_t$  (that is, the reward achieved until the end of the episode), unbiased.
+-  Bootstrapping targets are biased because they depend on $\mathbf{w}$ through $\hat{v}(S_t, \mathbf{w})$ .
+	- To make them unbiased, you can treat the dependent expressions as constants (stop the gradient flow). This yields *semi-gradient methods*.
+
+*Semi-gradient methods*:
+- Do not converge as robustly as gradient methods, aside from the linear case.
+- Faster, enable online/continual learning.
+
+![[Pasted image 20240923172823.png|Pasted image 20240923172823.png]]
+
+## 9.4 Linear Methods
+
+> [!NOTE] Equation 9.8
+> 
+> $$
+> \begin{align}
+> 	\hat{v}(s, \mathbf{w}) \doteq \mathbf{w}^\intercal \mathbf{x}(s) = \sum_{i=1}^d w_i x_i(s) && \tag{9.8}
+> \end{align}
+> $$
+> 
+> Where:
+> - $\mathbf{x}(s) = \left(x_1(s), \dots, x_d(s)\right)^\intercal$
+
+- Chapter also explores the convergence of TD(0) with SGD and linear approximation and finds it converges to the *TD fixed point* (Eqs. 9.11, 9.12), $\mathbf{w}_{TD}$.
+
+
+> [!NOTE] Equation 9.14
+> 
+> Interpretation: The asymptotic error of the TD method is no more than $\frac{1}{1-\gamma}$ times the *smallest possible error*.
+> 
+> $$
+> \begin{align}
+> 	\overline{VE}(\mathbf{w}_{TD}) & \leq \frac{1}{1-\gamma} \min_{\mathbf{w}} \overline{VE}(\mathbf{w}) \tag{9.14}
+> \end{align}
+> $$
+
+
+![[Pasted image 20240923173826.png|Pasted image 20240923173826.png]]
+
+> [!NOTE] Equation 9.15
+> 
+> $$
+> \mathbf{w}_{t+n} \doteq \mathbf{w}_{t+n-1} + \alpha \left[ G_{t:t+n} - \hat{v}(S_t, \mathbf{w}_{t+n-1}) \right] \nabla \hat{v}(S_t, \mathbf{w}_{t+n-1}), \quad 0 \leq t < T, \tag{9.15}
+> $$
+
+> [!NOTE] Equation 9.16
+> 
+> $$
+> G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \mathbf{w}_{t+n-1}), \quad 0 \leq t \leq T - n. \tag{9.16}
+> $$
+
+
+## 9.5 Feature Construction for Linear Methods
+
+- 9.5.1 Polynomials
+- 9.5.2 Fourier Basis
+- 9.5.3 Coarse coding
+- 9.5.4 Tile Coding
+- 9.5.5 Radial Basis Functions
+
+## 9.6 Selecting Step-Size Parameters Manually
+
+
+> [!NOTE] Equation 9.19
+>  A good rule of thumb for setting the step-size parameter of *linear SGD methods* is:
+>  
+> $$
+> \begin{align}
+> 	\alpha \doteq \left(\tau \mathbb{E}\left[\mathbf{x}^\intercal\mathbf{x}\right]\right)^{-1} \tag{9.19}
+> \end{align}
+> $$
+
+
diff --git a/docs/images/Pasted image 20240923171752.png b/docs/images/Pasted image 20240923171752.png
diff --git a/docs/images/Pasted image 20240923172823.png b/docs/images/Pasted image 20240923172823.png
diff --git a/docs/images/Pasted image 20240923173826.png b/docs/images/Pasted image 20240923173826.png