Skip to content

Commit

Permalink
[PUBLISHER] Merge #42
Browse files Browse the repository at this point in the history
* PUSH NOTE : PyTorch Conference 2024 - Fast Sparse Vision Transformers with minimal accuracy loss.md

* PUSH ATTACHMENT : Pasted image 20240928133556.png

* PUSH ATTACHMENT : Pasted image 20240928133618.png

* PUSH ATTACHMENT : Pasted image 20240928133712.png

* PUSH ATTACHMENT : Pasted image 20241001102234.png

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 7.md

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 5.md

* PUSH ATTACHMENT : Pasted image 20240929183258.png

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 3.md

* PUSH ATTACHMENT : Pasted image 20240928215826.png

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 11.md

* PUSH NOTE : PyTorch internals.md

* PUSH ATTACHMENT : Pasted image 20240928131008.png

* PUSH NOTE : Reinforcement Learning - An Introduction.md

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 9.md

* PUSH NOTE : Reinforcement Learning - An Introduction - Chapter 6.md
  • Loading branch information
dgcnz authored Oct 1, 2024
1 parent c33cfa6 commit d87a6c9
Show file tree
Hide file tree
Showing 16 changed files with 355 additions and 2 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
authors:
- "[[Jesse Cai|Jesse Cai]]"
year: 2024
tags:
- presentation
url: https://static.sched.com/hosted_files/pytorch2024/c6/Sparsifying%20ViT%20lightning%20talk%20slides.pdf?_gl=1*19zah9b*_gcl_au*MTk3MjgxODE5OC4xNzI3MjU4NDM2*FPAU*MTk3MjgxODE5OC4xNzI3MjU4NDM2
share: true
---
Nice, it is on `torchao`

![[Pasted image 20240928133556.png|Pasted image 20240928133556.png]]

![[Pasted image 20240928133618.png|Pasted image 20240928133618.png]]

![[Pasted image 20240928133712.png|Pasted image 20240928133712.png]]


![[Pasted image 20241001102234.png|Pasted image 20241001102234.png]]
Notes:
- Don't quite understand what does Core or AO mean in this context, but at least `torch.compile` is acknowledged :p

12 changes: 12 additions & 0 deletions docs/100 Reference notes/104 Other/PyTorch internals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
authors:
- "[[Edward Z. Yang|Edward Z. Yang]]"
year: 2019
tags:
- blog
url: http://blog.ezyang.com/2019/05/pytorch-internals/
share: true
---

Depending on tensor metadata (if it's CUDA, or sparse, etc) it's dispatched to different implementations ()
![[Pasted image 20240928131008.png|500]]
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
authors:
- "[[Richard S. Sutton|Richard S. Sutton]]"
- "[[Andrew G. Barton|Andrew G. Barton]]"
year: 2018
tags:
- textbook
url:
share: true
---
## 11.5 Gradient Descent in the Bellman Error

> [!NOTE] Mean-squared temporal difference error
>
> $$
> \begin{align}
> \overline{TDE}(\mathbf{w}) &= \sum_{s \in \mathcal{S}} \mu(s) \mathbb{E}\left[\delta_t^2 \mid S_t = s, A_t \sim \pi \right] \\
> &= \sum_{s \in \mathcal{S}} \mu(s) \mathbb{E}\left[\rho_t \delta_t^2 \mid S_t = s, A_t \sim b \right] \\
> &= \mathbb{E}_b\left[\rho_t \delta_t^2 \right]
> \end{align}
> $$
> [!NOTE] Equation 11.23: Weight update of naive residual-gradient algoritm
>
> $$
> \begin{align}
> \mathbf{w}_{t+1} &= \mathbf{w}_t - \frac{1}{2} \alpha \nabla(\rho_t \delta_t^2) \\
> &= \mathbf{w}_t - \alpha \rho_t \delta_t \nabla(\delta_t) \\
> &= \mathbf{w}_t - \alpha \rho_t \delta_t (\nabla \hat{v}(S_t, \mathbf{w}_t) - \gamma \nabla \hat{v}(S_{t+1}, \mathbf{w}_t)) \tag{11.23} \\
> \end{align}
> $$


Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
---
authors:
- "[[Richard S. Sutton|Richard S. Sutton]]"
- "[[Andrew G. Barton|Andrew G. Barton]]"
year: 2018
tags:
- textbook
url:
share: true
---
## 3.1 The Agent-Environment Interface


> [!NOTE] Equation 3.1: Trajectory
>
> $$
> S_0,A_0,R_1,S_1,A_1,R_2,S_2,A_2,R_3, \dots \tag{3.1}
> $$

> [!NOTE] Equation 3.2: MDP dynamics
>
> $$
> p(s', r \mid s, a) \doteq \Pr \{ S_t = s', R_t = r \mid S_{t-1} = s, A_{t-1} = a \} \tag{3.2}
> $$

You can obtain the *state-transition probabilities* and the with the law of total probability.
You can obtain the expected reward also.

## 3.2 Goals and Rewards

> [!FAQ]- What is the reward hypothesis?
>
> The reward hypothesis is the idea that **all of what we mean by goals** and purposes can be well thought of as the **maximization** of the expected value of the cumulative sum of a received scalar signal (called **reward**).

- The reward signal is your way of communicating to the agent what you want it to achieve **not how you want it to achieve it**.


## 3.3 Returns and Episodes

> [!NOTE] Equation 3.7: Undiscounted return
>
> $$
> G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + \dots + R_T \tag{3.7}
> $$
> [!NOTE] Equation 3.8: Discounted return
>
> $$
> G_t \doteq R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \tag{3.8}
> $$
>
> Where $\gamma$ is the discount rate.

> [!NOTE] Equation 3.9: Recursive definition of return
>
> You can group Eq 3.8 into a recursive definition of the return.
>
> $$
> G_t \doteq R_{t+1} + \gamma G_{t+1} \tag{3.9}
> $$

## 3.4 Unified Notation for Episodic and Continuing Tasks

![[Pasted image 20240928215826.png|Pasted image 20240928215826.png]]

## 3.5 Policies and Value Functions

A policy $\pi(a \mid s)$ is a probability distribution over actions given states.

> [!NOTE] Equation 3.12: State-value function
>
> $$
> v_{\pi}(s) \doteq \mathbb{E}_{\pi}[G_t \mid S_t = s] \;\; \forall s \in \mathcal{S} \tag{3.12}
>
> $$
> [!NOTE] Equation 3.13: Action-value function
>
> $$
> q_{\pi}(s, a) \doteq \mathbb{E}_{\pi}[G_t \mid S_t = s, A_t = a] \;\; \forall s \in \mathcal{S}, a \in \mathcal{A} \tag{3.13}
> $$
> [!NOTE] Equation 3.14: Bellman equation for $v_{\pi}$
>
> $$
> \begin{align}
> v_\pi(s) &\doteq \mathbb{E}_{\pi}[G_t \mid S_t = s] \\
> &= \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} \mid S_t = s] \tag{by (3.9)} \\
> &= \sum_{a} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) \left[r + \gamma \mathbb{E}_{\pi}\left[G_{t+1} \mid S_{t+1} = s'\right]\right] \\
> &= \sum_{a} \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_\pi(s')] \tag{3.14}
> \end{align}
> $$
## 3.6 Optimal Policies and Optimal Value Functions

> [!NOTE] Equation 3.15: Optimal state-value function
>
> $$
> v_*(s) \doteq \max_{\pi} v_{\pi}(s) \tag{3.15}
> $$
> [!NOTE] Equation 3.16: Optimal action-value function
>
> $$
> q_*(s, a) \doteq \max_{\pi} q_{\pi}(s, a) \tag{3.16}
> $$
> [!NOTE] Equation 3.17
>
> $$
> q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a] \tag{3.17}
> $$
> [!NOTE] Equation 3.18 and 3.19: Bellman optimality equations for $v_*$
>
> $$
> \begin{align}
> v_*(s) &= \max_{a \in \mathcal{A}(s)} q_{\pi_*}(s, a) \\
> &= \max_{a} \mathbb{E}_{\pi_*}[G_t \mid S_t = s, A_t = a] \tag{by (3.9)}\\
> &= \max_{a} \mathbb{E}_{\pi_*}[R_{t+1} + \gamma G_{t+1} \mid S_t = s, A_t = a] \\
> &= \max_{a} \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t = s, A_t = a] \tag{3.18} \\
> &= \max_{a} \sum_{s', r} p(s', r \mid s, a) [r + \gamma v_*(s')] \tag{3.19} \\
> \end{align}
> $$
> [!NOTE] Equation 3.20: Bellman optimality equation for $q_*$
>
> $$
> \begin{align}
> q_*(s, a) &= \mathbb{E}[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') \mid S_t = s, A_t = a] \\
> &= \sum_{s', r} p(s', r \mid s, a) [r + \gamma \max_{a'} q_*(s', a')] \tag{3.20}
> \end{align}
> $$

**Any policy that is greedy with respect to the optimal evaluation function $v_*$ is an optimal policy.**




Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
authors:
- "[[Richard S. Sutton|Richard S. Sutton]]"
- "[[Andrew G. Barton|Andrew G. Barton]]"
year: 2018
tags:
- textbook
url:
share: true
---
## 5.1 Monte Carlo prediction

first-visit mc
- independence assumptions, easier theoretically
every-visit mc

- [ ] TODO: finish notes
## 5.4 Monte Carlo Control without Exploring Starts

- $\epsilon-$greedy policy
- All non-greedy actions have minimum probability of $\frac{\epsilon}{|\mathcal{A}|}$
- Greedy action has probability $(1 - \epsilon) + \frac{\epsilon}{|\mathcal{A}|}$

- [ ] TODO: finish notes

## 5.5 Off-policy Prediction via Importance Sampling

Given a starting state $S_t$, the probability of the subsequent state-action trajectory, $A_t, S_{t+1}, A_{t+1}, \dots, S_T$, under the policy $\pi$ is given by:

$$
\begin{align}
Pr\{A_t, S_{t+1}, A_{t+1}, \dots, S_T \mid S_t, A_{t:T-1} \sim \pi\} & = \prod_{k=t}^{T-1} \pi(A_k \mid S_k) p(S_{k+1} \mid S_k, A_k)
\end{align}
$$


> [!NOTE] Equation 5.3: Important sampling ratio
>
> $$
> \rho_{t:T-1} \doteq \frac{\prod_{k=t}^{T-1} \pi(A_k \mid S_k) p(S_{k+1} \mid S_k, A_k)}{\prod_{k=t}^{T-1} b(A_k \mid S_k) p(S_{k+1} \mid S_k, A_k)} = \prod_{k=t}^{T-1} \frac{\pi(A_k \mid S_k)}{b(A_k \mid S_k)} \tag{5.3}
> $$
> [!NOTE] Equation 5.4: Value function for target function $\pi$ under behavior policy $b$
>
> The importance sampling ratio allows us to compute the correct expected value to compute $v_\pi$:
>
> $$
> \begin{align}
> v_\pi(s) &\doteq \mathbb{E}_b[\rho_{t:T - 1}G_t \mid S_t = s] \tag{5.4} \\
> \end{align}
> $$
> [!NOTE] Equation 5.5: Ordinary importance sampling
>
> $$
> V(s) \doteq \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T-1} G_t}{|\mathcal{T}(s)|} \tag{5.5}
> $$
> [!NOTE] Equation 5.6: Weighted importance sampling
>
> $$
> V(s) \doteq \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T-1} G_t}{\sum_{t \in \mathcal{T}(s)} \rho_{t:T-1}} \tag{5.6}
> $$
![[Pasted image 20240929183258.png|Pasted image 20240929183258.png]]

In practice, weighted importance sampling has much lower error at the beginning.


## 5.6 Incremental Implementation

#todo

Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,45 @@ tags:
url:
share: true
---
## 6.1 TD Prediction

> [!NOTE] Equation 6.2: TD(0) update
>
> $$
> \begin{align}
> V(S_t) &\leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right] \tag{6.2} \\
> \end{align}
> $$
> [!NOTE] Equations 6.3 and 6.4: Relationship between TD(0), MC and DP
>
> $$
> \begin{align}
> v_\pi(s) &\doteq \mathbb{E}_\pi[G_t \mid S_t = s] \tag{6.3} \\
> &= \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1} \mid S_t = s] \tag{from (3.9)} \\
> &= \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s] \tag{6.4} \\
> \end{align}
> $$
> [!faq]- Why is (6.3) called the Monte Carlo *estimate*?
> Because the expected value is not known, and sampled returns are used in its place.
> [!faq]- Why is (6.4) called the Dynamic Programming *estimate*?
> Although the expectation is known, the value function is not, as we use the estimate $V(S_t)$.
> [!faq]- By looking at the previous two answers, what does TD(0) estimate and how does that differ from the previous methods?
> TD(0) maintains both an estimate of the value function and uses a sample reward as the estimate to the expectation.


> [!NOTE] Equation 6.5: TD error
>
> $$
> \begin{align}
> \delta_t &\doteq R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \tag{6.5}
> \end{align}
> $$
## 6.4 Sarsa: On-policy TD Control

> [!NOTE] Equation 6.7
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
authors:
- "[[Richard S. Sutton|Richard S. Sutton]]"
- "[[Andrew G. Barton|Andrew G. Barton]]"
year: 2018
tags:
- textbook
url:
share: true
---
## 7.1 $n$-step TD prediction

One-step return:

$$
G_{t:t+1} \doteq R_{t+1} + \gamma V_t(S_{t+1})
$$

> [!NOTE] Equation 7.1: $n$-step return
>
> $$
> G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t + n - 1}(S_{t+n}) \tag{7.1}
> $$


Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ Examples of $U_t$:


> [!NOTE] Equation 9.19
> A good rule of thumb for setting the step-size parameter of *linear SGD methods* is:
> Suppose you wanted to learn in about $\tau$ experiences with substantially the same feature vector. A good rule of thumb for setting the step-size parameter of *linear SGD methods* is:
>
> $$
> \begin{align}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@ tags:
url:
share: true
---

- [[Reinforcement Learning - An Introduction - Chapter 3|Reinforcement Learning - An Introduction - Chapter 3]]
- [[Reinforcement Learning - An Introduction - Chapter 4|Reinforcement Learning - An Introduction - Chapter 4]]
- [[Reinforcement Learning - An Introduction - Chapter 6|Reinforcement Learning - An Introduction - Chapter 6]]
- [[Reinforcement Learning - An Introduction - Chapter 9|Reinforcement Learning - An Introduction - Chapter 9]]


Binary file added docs/images/Pasted image 20240928131008.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20240928133556.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20240928133618.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20240928133712.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20240928215826.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20240929183258.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/Pasted image 20241001102234.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit d87a6c9

Please sign in to comment.