[W2D5] T2: equation for greedy policy #446

willwayaka · 2020-07-23T16:40:00Z

In the equation for greedy policy:

\begin{align}
P (a_{t} = a) =
\begin{cases}
1 - \epsilon & \quad \text{if } a_{t} = \text{argmax}{a} ; q{t} (a) \
\epsilon/N & \quad \text{else}
\end{cases}
\end{align}

(between sec2.1 and ex1) what does N stand for? There is no mention of it in the text.

rschween · 2020-07-24T05:31:03Z

As of slack discussion with @marcelomattar, N stands for the number of actions, but there are inconsistencies both within equation and in the text below regarding whether the argmax-option is among the options to choose in the non-greedy case. What the solution code does is it chooses the greedy option with probability (1-epsilon) and between all options (including the greedy), otherwise.

Suggested fix (in line with Sutton & Barto)

change first row of equation (probabiligy for argmax-option) to 1-epsilon + epsilon/N
define N below equation
adapt text below first bar graph accordingly

Alternatively (appears to also be used in practice; makes equation look slightly less complex):

change second row of equation (probability for non-argmax options to epsilon/(N-1)
define N below equation
adapt text below equation accordingly
change solution code to only select among non-argmax option in the "else:" condition.

willwayaka · 2020-07-24T17:00:34Z

After going through the slack discussion, I propose the following fix:

$\epsilon$-greedy policy:
\begin{align}
P (a_t = a_k) = 
        \begin{cases}
        1 - \epsilon    & \quad \text{then } a_{k} = \arg\max_{a} q_{t} (a) \\
        \epsilon/N       & \quad \text{then } a_{k} \in \{a\}
        \end{cases} 
\end{align}
where $\epsilon \in [0,1]$, and $N$ is number of actions in $\{a\}$ available to the agent (including the optimal action, so that $1 - \epsilon + \epsilon/N \times N = 1$).

JesseLivezey · 2020-07-24T22:15:14Z

@willwayaka IMO, the first case should be 1 - \epsilon + 1/N, then the second condition should be an "else". I think it confusing to have the actions contained in the "if" and the "else" be overlapping.

* Closes #387 and fixes RGB -> C1C2C0 * Closes #395 * Process tutorial notebooks * Closes #483 * Process tutorial notebooks * Closes #406 and fixes colors * Process tutorial notebooks * Closes #404 * Process tutorial notebooks * Closes #419 * Process tutorial notebooks * Closes #410 T3 comments * Closes #410 * Process tutorial notebooks * Partially addresses #435 for T1 * Process tutorial notebooks * Created using Colaboratory * Process tutorial notebooks * Closes #457 #446, partially addresses #455 * Closes #456 * Process tutorial notebooks * Closes #455 * Created using Colaboratory * Closes #460 * Process tutorial notebooks Co-authored-by: GitHub Action <[email protected]>

mwaskom added the W2D5 label Jul 23, 2020

JesseLivezey added a commit that referenced this issue Aug 24, 2020

Closes #457 #446, partially addresses #455

d43451c

JesseLivezey closed this as completed Aug 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[W2D5] T2: equation for greedy policy #446

[W2D5] T2: equation for greedy policy #446

willwayaka commented Jul 23, 2020 •

edited

Loading

rschween commented Jul 24, 2020 •

edited

Loading

willwayaka commented Jul 24, 2020

JesseLivezey commented Jul 24, 2020

[W2D5] T2: equation for greedy policy #446

[W2D5] T2: equation for greedy policy #446

Comments

willwayaka commented Jul 23, 2020 • edited Loading

rschween commented Jul 24, 2020 • edited Loading

willwayaka commented Jul 24, 2020

JesseLivezey commented Jul 24, 2020

willwayaka commented Jul 23, 2020 •

edited

Loading

rschween commented Jul 24, 2020 •

edited

Loading