[PUBLISHER] Merge #42

* PUSH NOTE : PyTorch Conference 2024 - Fast Sparse Vision Transformers with minimal accuracy loss.md * PUSH ATTACHMENT : Pasted image 20240928133556.png * PUSH ATTACHMENT : Pasted image 20240928133618.png * PUSH ATTACHMENT : Pasted image 20240928133712.png * PUSH ATTACHMENT : Pasted image 20241001102234.png * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 7.md * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 5.md * PUSH ATTACHMENT : Pasted image 20240929183258.png * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 3.md * PUSH ATTACHMENT : Pasted image 20240928215826.png * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 11.md * PUSH NOTE : PyTorch internals.md * PUSH ATTACHMENT : Pasted image 20240928131008.png * PUSH NOTE : Reinforcement Learning - An Introduction.md * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 9.md * PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 6.md
dgcnz · Oct 1, 2024 · d87a6c9 · d87a6c9
1 parent c33cfa6
commit d87a6c9
Show file tree

Hide file tree

Showing 16 changed files with 355 additions and 2 deletions.
diff --git a/...Conference 2024 - Fast Sparse Vision Transformers with minimal accuracy loss.md b/...Conference 2024 - Fast Sparse Vision Transformers with minimal accuracy loss.md
@@ -0,0 +1,22 @@
+---
+authors:
+  - "[[Jesse Cai|Jesse Cai]]"
+year: 2024
+tags:
+  - presentation
+url: https://static.sched.com/hosted_files/pytorch2024/c6/Sparsifying%20ViT%20lightning%20talk%20slides.pdf?_gl=1*19zah9b*_gcl_au*MTk3MjgxODE5OC4xNzI3MjU4NDM2*FPAU*MTk3MjgxODE5OC4xNzI3MjU4NDM2
+share: true
+---
+Nice, it is on `torchao`
+
+![[Pasted image 20240928133556.png|Pasted image 20240928133556.png]]
+
+![[Pasted image 20240928133618.png|Pasted image 20240928133618.png]]
+
+![[Pasted image 20240928133712.png|Pasted image 20240928133712.png]]
+
+
+![[Pasted image 20241001102234.png|Pasted image 20241001102234.png]]
+Notes:
+- Don't quite understand what does Core or AO mean in this context, but at least `torch.compile` is acknowledged :p
+
diff --git a/docs/100 Reference notes/104 Other/PyTorch internals.md b/docs/100 Reference notes/104 Other/PyTorch internals.md
@@ -0,0 +1,12 @@
+---
+authors:
+  - "[[Edward Z. Yang|Edward Z. Yang]]"
+year: 2019
+tags:
+  - blog
+url: http://blog.ezyang.com/2019/05/pytorch-internals/
+share: true
+---
+
+Depending on tensor metadata (if it's CUDA, or sparse, etc) it's dispatched to different implementations ()
+![[Pasted image 20240928131008.png|500]]
diff --git a/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 11.md b/...erence notes/104 Other/Reinforcement Learning - An Introduction - Chapter 11.md
@@ -0,0 +1,34 @@
+---
+authors:
+  - "[[Richard S. Sutton|Richard S. Sutton]]"
+  - "[[Andrew G. Barton|Andrew G. Barton]]"
+year: 2018
+tags:
+  - textbook
+url: 
+share: true
+---
+## 11.5 Gradient Descent in the Bellman Error
+
+> [!NOTE] Mean-squared temporal difference error
+> 
+> $$
+> \begin{align}
+> \overline{TDE}(\mathbf{w}) &= \sum_{s \in \mathcal{S}} \mu(s) \mathbb{E}\left[\delta_t^2 \mid S_t = s, A_t \sim \pi \right] \\
+> &= \sum_{s \in \mathcal{S}} \mu(s) \mathbb{E}\left[\rho_t \delta_t^2 \mid S_t = s, A_t \sim b \right] \\
+> &= \mathbb{E}_b\left[\rho_t \delta_t^2 \right]
+> \end{align}
+> $$
+
+> [!NOTE] Equation 11.23: Weight update of naive residual-gradient algoritm
+> 
+> $$
+> \begin{align}
+> \mathbf{w}_{t+1} &= \mathbf{w}_t - \frac{1}{2} \alpha \nabla(\rho_t \delta_t^2) \\
+>  &= \mathbf{w}_t - \alpha \rho_t \delta_t \nabla(\delta_t) \\
+>  &= \mathbf{w}_t - \alpha \rho_t \delta_t (\nabla \hat{v}(S_t, \mathbf{w}_t) - \gamma \nabla \hat{v}(S_{t+1}, \mathbf{w}_t)) \tag{11.23} \\
+> \end{align}
+> $$
+
+
+
diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 3.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 3.md
@@ -0,0 +1,145 @@
+---
+authors:
+  - "[[Richard S. Sutton|Richard S. Sutton]]"
+  - "[[Andrew G. Barton|Andrew G. Barton]]"
+year: 2018
+tags:
+  - textbook
+url: 
+share: true
+---
+## 3.1 The Agent-Environment Interface
+
+
+> [!NOTE] Equation 3.1: Trajectory
+> 
+> $$ 
+> S_0,A_0,R_1,S_1,A_1,R_2,S_2,A_2,R_3, \dots  \tag{3.1}
+> $$
+
+
+> [!NOTE] Equation 3.2: MDP dynamics
+> 
+> $$ 
+> p(s', r \mid s, a) \doteq \Pr \{ S_t = s', R_t = r \mid S_{t-1} = s, A_{t-1} = a \} \tag{3.2}
+> $$ 
+
+
+You can obtain the *state-transition probabilities* and the with the law of total probability.
+You can obtain the expected reward also.
+
+## 3.2  Goals and Rewards
+
+> [!FAQ]- What is the reward hypothesis?
+> 
+> The reward hypothesis is the idea that **all of what we mean by goals** and purposes can be well thought of as the **maximization** of the expected value of the cumulative sum of a received scalar signal (called **reward**). 
+
+
+- The reward signal is your way of communicating to the agent what you want it to achieve **not how you want it to achieve it**.
+
+
+## 3.3 Returns and Episodes
+
+> [!NOTE] Equation 3.7: Undiscounted return
+> 
+> $$
+> G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \dots + R_T \tag{3.7}
+> $$
+
+> [!NOTE] Equation 3.8: Discounted return
+> 
+> $$
+> G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \tag{3.8}
+> $$
+> 
+> Where $\gamma$ is the discount rate.
+
+
+> [!NOTE] Equation 3.9: Recursive definition of return
+>  
+>  You can group Eq 3.8 into a recursive definition of the return.
+> 
+> $$
+> G_t \doteq R_{t+1} + \gamma G_{t+1} \tag{3.9}
+> $$
+
+
+## 3.4 Unified Notation for Episodic and Continuing Tasks
+
+![[Pasted image 20240928215826.png|Pasted image 20240928215826.png]]
+
+## 3.5 Policies and Value Functions
+
+A policy $\pi(a \mid s)$ is a probability distribution over actions given states.
+
+> [!NOTE] Equation 3.12: State-value function
+> 
+> $$
+> v_{\pi}(s) \doteq \mathbb{E}_{\pi}[G_t \mid S_t = s] \;\; \forall s \in \mathcal{S} \tag{3.12}
+> 
+> $$
+
+> [!NOTE] Equation 3.13: Action-value function
+> 
+> $$
+> q_{\pi}(s, a) \doteq \mathbb{E}_{\pi}[G_t \mid S_t = s, A_t = a] \;\; \forall s \in \mathcal{S}, a \in \mathcal{A} \tag{3.13}
+> $$
+
+> [!NOTE] Equation 3.14: Bellman equation for $v_{\pi}$
+> 
+> $$
+> \begin{align}
+> v_\pi(s) &\doteq \mathbb{E}_{\pi}[G_t \mid S_t = s] \\
+> &= \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} \mid S_t = s] \tag{by (3.9)} \\
+> &= \sum_{a} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) \left[r + \gamma \mathbb{E}_{\pi}\left[G_{t+1} \mid S_{t+1} = s'\right]\right] \\
+> &= \sum_{a} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_\pi(s')] \tag{3.14}
+> \end{align}
+> $$
+
+## 3.6 Optimal Policies and Optimal Value Functions
+
+> [!NOTE] Equation 3.15: Optimal state-value function
+> 
+> $$
+> v_*(s) \doteq \max_{\pi} v_{\pi}(s) \tag{3.15}
+> $$
+
+> [!NOTE] Equation 3.16: Optimal action-value function
+> 
+> $$
+> q_*(s, a) \doteq \max_{\pi} q_{\pi}(s, a) \tag{3.16}
+> $$
+
+> [!NOTE] Equation 3.17
+> 
+> $$
+> q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a] \tag{3.17}
+> $$
+
+> [!NOTE] Equation 3.18 and 3.19: Bellman optimality equations for $v_*$
+> 
+> $$
+> \begin{align}
+> v_*(s) &= \max_{a \in \mathcal{A}(s)} q_{\pi_*}(s, a) \\
+> &= \max_{a} \mathbb{E}_{\pi_*}[G_t \mid S_t = s, A_t = a] \tag{by (3.9)}\\
+> &= \max_{a} \mathbb{E}_{\pi_*}[R_{t+1} + \gamma G_{t+1} \mid S_t = s, A_t = a] \\
+> &= \max_{a} \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a] \tag{3.18} \\
+> &= \max_{a} \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_*(s')] \tag{3.19} \\
+> \end{align}
+> $$
+
+> [!NOTE] Equation 3.20: Bellman optimality equation for $q_*$
+> 
+> $$
+> \begin{align}
+> q_*(s, a) &= \mathbb{E}[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') \mid S_t = s, A_t = a] \\
+> &= \sum_{s', r} p(s', r \mid s, a) [r + \gamma \max_{a'} q_*(s', a')] \tag{3.20}
+> \end{align}
+> $$
+
+
+**Any policy that is greedy with respect to the optimal evaluation function $v_*$ is an optimal policy.**
+
+
+
+
diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 5.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 5.md
@@ -0,0 +1,73 @@
+---
+authors:
+  - "[[Richard S. Sutton|Richard S. Sutton]]"
+  - "[[Andrew G. Barton|Andrew G. Barton]]"
+year: 2018
+tags:
+  - textbook
+url: 
+share: true
+---
+## 5.1 Monte Carlo prediction
+
+first-visit mc
+- independence assumptions, easier theoretically
+every-visit mc
+
+- [ ] TODO: finish notes
+## 5.4 Monte Carlo Control without Exploring Starts
+
+- $\epsilon-$greedy policy
+	- All non-greedy actions have minimum probability of $\frac{\epsilon}{|\mathcal{A}|}$
+	- Greedy action has probability $(1 - \epsilon) + \frac{\epsilon}{|\mathcal{A}|}$
+
+- [ ] TODO: finish notes
+
+## 5.5 Off-policy Prediction via Importance Sampling
+
+Given a starting state $S_t$, the probability of the subsequent state-action trajectory, $A_t, S_{t+1}, A_{t+1}, \dots, S_T$, under the policy $\pi$ is given by:
+
+$$
+\begin{align}
+Pr\{A_t, S_{t+1}, A_{t+1}, \dots, S_T \mid S_t, A_{t:T-1} \sim \pi\} & = \prod_{k=t}^{T-1} \pi(A_k \mid S_k) p(S_{k+1} \mid S_k, A_k)
+\end{align}
+$$
+
+
+> [!NOTE] Equation 5.3: Important sampling ratio
+> 
+> $$
+> \rho_{t:T-1} \doteq \frac{\prod_{k=t}^{T-1} \pi(A_k \mid S_k) p(S_{k+1} \mid S_k, A_k)}{\prod_{k=t}^{T-1} b(A_k \mid S_k) p(S_{k+1} \mid S_k, A_k)} = \prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)} \tag{5.3}
+> $$
+
+> [!NOTE] Equation 5.4: Value function for target function $\pi$ under behavior policy $b$
+> 
+> The importance sampling ratio allows us to compute the correct expected value to compute $v_\pi$:
+> 
+> $$
+> \begin{align}
+> v_\pi(s) &\doteq \mathbb{E}_b[\rho_{t:T - 1}G_t \mid S_t = s] \tag{5.4} \\
+> \end{align}
+> $$
+
+> [!NOTE] Equation 5.5: Ordinary importance sampling
+> 
+> $$
+> V(s) \doteq \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T-1} G_t}{|\mathcal{T}(s)|} \tag{5.5}
+> $$
+
+> [!NOTE] Equation 5.6: Weighted importance sampling
+> 
+> $$
+> V(s) \doteq \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T-1} G_t}{\sum_{t \in \mathcal{T}(s)} \rho_{t:T-1}} \tag{5.6}
+> $$
+
+![[Pasted image 20240929183258.png|Pasted image 20240929183258.png]]
+
+In practice, weighted importance sampling has much lower error at the beginning.
+
+
+## 5.6 Incremental Implementation
+
+#todo 
+
diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 6.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 6.md
@@ -8,6 +8,45 @@ tags:
 url: 
 share: true
 ---
+## 6.1 TD Prediction
+
+> [!NOTE] Equation 6.2: TD(0) update
+> 
+> $$
+> \begin{align}
+> V(S_t) &\leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]  \tag{6.2}  \\
+> \end{align}
+> $$
+
+> [!NOTE] Equations 6.3 and 6.4: Relationship between TD(0), MC and DP
+> 
+> $$
+> \begin{align}
+> v_\pi(s) &\doteq \mathbb{E}_\pi[G_t \mid S_t = s] \tag{6.3} \\
+> &= \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} \mid S_t = s]  \tag{from (3.9)} \\
+> &= \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s] \tag{6.4} \\
+> \end{align}
+> $$
+
+> [!faq]- Why is (6.3) called the Monte Carlo *estimate*?
+> Because the expected value is not known, and sampled returns are used in its place.
+
+> [!faq]- Why is (6.4) called the Dynamic Programming *estimate*?
+> Although the expectation is known, the value function is not, as we use the estimate $V(S_t)$.
+
+> [!faq]- By looking at the previous two answers, what does TD(0) estimate and how does that differ from the previous methods?
+> TD(0) maintains both an estimate of the value function and uses a sample reward as the estimate to the expectation.
+
+
+
+> [!NOTE] Equation 6.5: TD error
+> 
+> $$
+> \begin{align}
+> \delta_t &\doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \tag{6.5} 
+> \end{align}
+> $$
+
 ## 6.4 Sarsa: On-policy TD Control
 
 > [!NOTE] Equation 6.7

diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 7.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 7.md
@@ -0,0 +1,26 @@
+---
+authors:
+  - "[[Richard S. Sutton|Richard S. Sutton]]"
+  - "[[Andrew G. Barton|Andrew G. Barton]]"
+year: 2018
+tags:
+  - textbook
+url: 
+share: true
+---
+## 7.1 $n$-step TD prediction
+
+One-step return:
+
+$$
+G_{t:t+1} \doteq R_{t+1} + \gamma V_t(S_{t+1})
+$$
+
+> [!NOTE] Equation 7.1: $n$-step return
+> 
+> $$
+> G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t + n - 1}(S_{t+n}) \tag{7.1}
+> $$
+
+
+
diff --git a/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 9.md b/...ference notes/104 Other/Reinforcement Learning - An Introduction - Chapter 9.md
@@ -167,7 +167,7 @@ Examples of $U_t$:
 
 
 > [!NOTE] Equation 9.19
->  A good rule of thumb for setting the step-size parameter of *linear SGD methods* is:
+>  Suppose you wanted to learn in about $\tau$ experiences with substantially the same feature vector. A good rule of thumb for setting the step-size parameter of *linear SGD methods* is:
 >  
 > $$
 > \begin{align}

diff --git a/docs/100 Reference notes/104 Other/Reinforcement Learning - An Introduction.md b/docs/100 Reference notes/104 Other/Reinforcement Learning - An Introduction.md
@@ -8,7 +8,9 @@ tags:
 url: 
 share: true
 ---
-
+- [[Reinforcement Learning - An Introduction - Chapter 3|Reinforcement Learning - An Introduction - Chapter 3]]
 - [[Reinforcement Learning - An Introduction - Chapter 4|Reinforcement Learning - An Introduction - Chapter 4]]
 - [[Reinforcement Learning - An Introduction - Chapter 6|Reinforcement Learning - An Introduction - Chapter 6]]
+- [[Reinforcement Learning - An Introduction - Chapter 9|Reinforcement Learning - An Introduction - Chapter 9]]
+
 
diff --git a/docs/images/Pasted image 20240928131008.png b/docs/images/Pasted image 20240928131008.png
diff --git a/docs/images/Pasted image 20240928133556.png b/docs/images/Pasted image 20240928133556.png
diff --git a/docs/images/Pasted image 20240928133618.png b/docs/images/Pasted image 20240928133618.png
diff --git a/docs/images/Pasted image 20240928133712.png b/docs/images/Pasted image 20240928133712.png
diff --git a/docs/images/Pasted image 20240928215826.png b/docs/images/Pasted image 20240928215826.png
diff --git a/docs/images/Pasted image 20240929183258.png b/docs/images/Pasted image 20240929183258.png
diff --git a/docs/images/Pasted image 20241001102234.png b/docs/images/Pasted image 20241001102234.png