You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following code uses the action logit value for the optimal action, and then diff against the log prob of the action from the last actor model iteration. Should we instead pick the action from old_actions instead just max, so that we are comparing the prob for the same action from two iterations?
The following code uses the action logit value for the optimal action, and then diff against the log prob of the action from the last actor model iteration. Should we instead pick the action from
old_actions
instead justmax
, so that we are comparing the prob for the same action from two iterations?The text was updated successfully, but these errors were encountered: