Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[W2D5] T2: equation for greedy policy #446

Closed
willwayaka opened this issue Jul 23, 2020 · 3 comments
Closed

[W2D5] T2: equation for greedy policy #446

willwayaka opened this issue Jul 23, 2020 · 3 comments

Comments

@willwayaka
Copy link

willwayaka commented Jul 23, 2020

In the equation for greedy policy:

\begin{align}
P (a_{t} = a) =
\begin{cases}
1 - \epsilon & \quad \text{if } a_{t} = \text{argmax}{a} ; q{t} (a) \
\epsilon/N & \quad \text{else}
\end{cases}
\end{align}

(between sec2.1 and ex1) what does N stand for? There is no mention of it in the text.

@mwaskom mwaskom added the W2D5 label Jul 23, 2020
@rschween
Copy link

rschween commented Jul 24, 2020

As of slack discussion with @marcelomattar, N stands for the number of actions, but there are inconsistencies both within equation and in the text below regarding whether the argmax-option is among the options to choose in the non-greedy case. What the solution code does is it chooses the greedy option with probability (1-epsilon) and between all options (including the greedy), otherwise.

Suggested fix (in line with Sutton & Barto)

  • change first row of equation (probabiligy for argmax-option) to 1-epsilon + epsilon/N
  • define N below equation
  • adapt text below first bar graph accordingly

Alternatively (appears to also be used in practice; makes equation look slightly less complex):

  • change second row of equation (probability for non-argmax options to epsilon/(N-1)
  • define N below equation
  • adapt text below equation accordingly
  • change solution code to only select among non-argmax option in the "else:" condition.

@willwayaka
Copy link
Author

After going through the slack discussion, I propose the following fix:

$\epsilon$-greedy policy:
\begin{align}
P (a_t = a_k) = 
        \begin{cases}
        1 - \epsilon    & \quad \text{then } a_{k} = \arg\max_{a} q_{t} (a) \\
        \epsilon/N       & \quad \text{then } a_{k} \in \{a\}
        \end{cases} 
\end{align}
where $\epsilon \in [0,1]$, and $N$ is number of actions in $\{a\}$ available to the agent (including the optimal action, so that $1 - \epsilon + \epsilon/N \times N = 1$).

@JesseLivezey
Copy link
Contributor

@willwayaka IMO, the first case should be 1 - \epsilon + 1/N, then the second condition should be an "else". I think it confusing to have the actions contained in the "if" and the "else" be overlapping.

mwaskom pushed a commit that referenced this issue Aug 25, 2020
* Closes #387 and fixes RGB -> C1C2C0

* Closes #395

* Process tutorial notebooks

* Closes #483

* Process tutorial notebooks

* Closes #406 and fixes colors

* Process tutorial notebooks

* Closes #404

* Process tutorial notebooks

* Closes #419

* Process tutorial notebooks

* Closes #410 T3 comments

* Closes  #410

* Process tutorial notebooks

* Partially addresses #435 for T1

* Process tutorial notebooks

* Created using Colaboratory

* Process tutorial notebooks

* Closes #457 #446, partially addresses #455

* Closes #456

* Process tutorial notebooks

* Closes #455

* Created using Colaboratory

* Closes #460

* Process tutorial notebooks

Co-authored-by: GitHub Action <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants