-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[question] Using keras in Custom Policy #220
Comments
Hello, See below for minimal code to reproduce (I got reward > 100) import tensorflow as tf
from stable_baselines import PPO2
from stable_baselines.common.policies import ActorCriticPolicy
class KerasPolicy(ActorCriticPolicy):
def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
super(KerasPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)
with tf.variable_scope("model", reuse=reuse):
flat = tf.keras.layers.Flatten()(self.processed_obs)
x = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_0')(flat)
pi_latent = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_1')(x)
x1 = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_0')(flat)
vf_latent = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_1')(x1)
value_fn = tf.keras.layers.Dense(1, name='vf')(vf_latent)
self.proba_distribution, self.policy, self.q_value = \
self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)
self.value_fn = value_fn
self.initial_state = None
self._setup_init()
def step(self, obs, state=None, mask=None, deterministic=False):
if deterministic:
action, value, neglogp = self.sess.run([self.deterministic_action, self._value, self.neglogp],
{self.obs_ph: obs})
else:
action, value, neglogp = self.sess.run([self.action, self._value, self.neglogp],
{self.obs_ph: obs})
return action, value, self.initial_state, neglogp
def proba_step(self, obs, state=None, mask=None):
return self.sess.run(self.policy_proba, {self.obs_ph: obs})
def value(self, obs, state=None, mask=None):
return self.sess.run(self._value, {self.obs_ph: obs})
model = PPO2(KerasPolicy, "CartPole-v1", verbose=1)
model.learn(25000)
env = model.get_env()
obs = env.reset()
reward_sum = 0.0
for _ in range(1000):
action, _ = model.predict(obs)
obs, reward, done, _ = env.step(action)
reward_sum += reward
env.render()
if done:
print("Reward: ", reward_sum)
reward_sum = 0.0
obs = env.reset()
env.close() I'm using tf-gpu (1.8.0) and latest version of stable-baselines (2.5.0a0 this is the |
Hey, After having a trying the code, I am getting the same problem. It seems that under TF 1.12.0 Keras is ignoring the There isn't much of a fix unfortunatly, as Keras seems to be using |
@araffin, @hill-a thank you very much for looking into this! This problem has been haunting me for a while. I think the best case scenario as a bandaid is to downgrade to TF 1.8.0. The difference between tf.get_variable vs tf.Variable is very unfortunate... Do you have an intuition as to how stable-baselines might change as years go on, given that with TF 2.0 is placing heavy bets on keras for the future facing way of doing things? |
If TF 2.0 were to be Keras-like, in my opinion the fix would be to have policies where the tensors are created, then the observation is passed through in an function like this: class CustomPolicy(ActorCriticPolicy):
def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
super(CustomPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=True)
self._build_kwargs = kwargs
with tf.variable_scope("model", reuse=self.reuse):
activ = tf.nn.relu
self.extracted_features = nature_cnn(**self._build_kwargs)
self.pi_layers = []
for i, layer_size in enumerate([128, 128, 128]):
self.pi_layers.append(activ(tf.layers.dense(layer_size, name='pi_fc' + str(i))))
self.vf_layers = []
for i, layer_size in enumerate([32, 32]):
self.vf_layers.append(activ(tf.layers.dense(layer_size, name='vf_fc' + str(i))))
self.value_fn = tf.layers.dense(1, name='vf')
self._setup_init()
def build(self, obs):
with tf.variable_scope("model", reuse=self.reuse):
pi_h = vf_h = self.extracted_features(obs)
for layer in self.pi_layers:
pi_h = layer(ph_h)
pi_latent = pi_h
for layer in self.vf_layers:
vf_h = layer(vf_h)
value_fn = self.value_fn(vf_h)
vf_latent = vf_h
self.proba_distribution, self.policy, self.q_value = \
self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)
self.value_fn = value_fn
self.initial_state = None of course , with quite a bit of backend to change (the init functions of base policies, and how the models build the policies) |
Are there any further plans regarding this? Now that we know TF 2.0 is going to drop tf.variable_scope and even handle sessions differently, will everything pretty much have to be rewritten? |
When I test the code from @araffin using tensorflow-gpu 1.8 and the latest pip install of stable-baselines on Ubuntu 16.04, I get the following error:
|
Would like to add my vote here as well. Will this get fixed at some point, or will we have to wait for the TF2.0 compatible version? Not being able to use predefined keras layers means that a ton of really useful model and layer libraries are unusable with stable-baselines, and that model code will be less future proof and much more difficult to read and maintain. This is a very unfortunate limitation to an otherwise really nice Deep RL library. |
I made some changes to the code as shown below and it seems to be working on stable-baselines (2.9.0) with tf-gpu==1.14.x import tensorflow as tf
from stable_baselines import PPO2
from stable_baselines.common.policies import ActorCriticPolicy
class KerasPolicy(ActorCriticPolicy):
def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=False, **kwargs):
super(KerasPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse=reuse, scale=False)
with tf.variable_scope("model", reuse=reuse):
flat = tf.keras.layers.Flatten()(self.processed_obs)
x = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_0')(flat)
pi_latent = tf.keras.layers.Dense(64, activation="tanh", name='pi_fc_1')(x)
x1 = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_0')(flat)
vf_latent = tf.keras.layers.Dense(64, activation="tanh", name='vf_fc_1')(x1)
value_fn = tf.keras.layers.Dense(1, name='vf')(vf_latent)
self._proba_distribution, self._policy, self.q_value = \
self.pdtype.proba_distribution_from_latent(pi_latent, vf_latent, init_scale=0.01)
self._value_fn = value_fn
self._setup_init()
def step(self, obs, state=None, mask=None, deterministic=False):
if deterministic:
action, value, neglogp = self.sess.run([self.deterministic_action, self.value_flat, self.neglogp],
{self.obs_ph: obs})
else:
action, value, neglogp = self.sess.run([self.action, self.value_flat, self.neglogp],
{self.obs_ph: obs})
return action, value, self.initial_state, neglogp
def proba_step(self, obs, state=None, mask=None):
return self.sess.run(self.policy_proba, {self.obs_ph: obs})
def value(self, obs, state=None, mask=None):
return self.sess.run(self.value_flat, {self.obs_ph: obs})
model = PPO2(KerasPolicy, "CartPole-v1", verbose=1, tensorboard_log='./log')
model.learn(25000)
env = model.get_env()
obs = env.reset()
reward_sum = 0.0
for _ in range(1000):
action, _ = model.predict(obs)
obs, reward, done, _ = env.step(action)
reward_sum += reward
env.render()
if done:
print("Reward: ", reward_sum)
reward_sum = 0.0
obs = env.reset()
env.close() |
Running Ubuntu 18.04.2 LTS, Docker 19.03.6 running tensorflow/tensorflow:1.14.0-gpu-py3-jupyter w/ stable_baselines '2.10.0' FWIW I cannot get PPO2 agent to learn CartPole using this Keras Policy 'as is', whereas when I use the default MlpPolicy, training occurs fine. Discounted reward chart shown here: @AvisekNaug using your code present above, I would have expected a like-for-like to the default MlpPolicy, using two layers of 64 neuron dense. Are you able to get training to occur successfully? |
Yeah, it does not for reasons discussed by @hill-a . It is an issue with Keras. where model=reuse does not seem to work has intended. See his response above. I merely tried to answer pirobots issues for Stable baselines 2.10. But yeah, it does not train with Keras layers properly. |
I am trying to use keras to define my own custom policy, unfortunately after several hours of trying I couldn't get it to train on CartPole.
Here is the
CustomPolicy
example I have modified to work with Cartpole, and this trains properly.Here is the Keras version of my implementation that runs, but does NOT train. (tf.keras.layers vs keras.layers) doesn't make a difference.
I tried to ensure both implementations are as close to eachother as possible. Any help at this point would be grately appreciated.
Thank you in advance
Keras version: 2.2.2
Tensorflow version: 1.12.0
Stable Baselines version: 2.4.0a
Attached is the minimal code to reproduce the current issue with tensorboard graphs for comparison.
custom_model.py.zip
The text was updated successfully, but these errors were encountered: