Trained on:
- Intel Xeon CPU E5-1650 v3 @ 3.50GHz (12 threads)
- 1 nvidia RTX 2080 TI
- 30 GB RAM
Time per episode: ~3.5s
Thus, time for 150 episodes: ~6.07 days
For models in prio/
, we used a prioritized replay buffer,
otherwise we used a randomly sampling buffer.
All the hyperparameters used can be found in our config. Unfortunately, since training takes some time, we were not able to try a variety of hyperparameters.
We used 4 different seeds (for env and numpy) for training and evaluation, namely
- 42
- 366
- 533
- 1337
To click through our data yourself, you can run
$ /home/dentarthur/rl-project> tensorboard --logdir models/tensorboard
Note: Over time, epsilon decays. The car will therefore be faster and moves forward with a higher probability.
This plot show the average reward played over all 4 seeds gained in 10 episodes. As you can see, the performance is hardly influenced by seed or buffer implementation.
As the plots show, the agent has not learned a policy that solves the environment. To be more precise, it has learned to drive straight forward and hopes that most of the area in front is road, thus maximizing its reward before hitting the void.
We think this is due to the small likelihood of ever successfully exploring how to drive a curve when using epsilon greedy exploration.
Normal agent (left) and agent with 0.5 eps (right):
With more time (and less training time) we would try to:
- implement a noisy DQN for a different exploration
- shape the reward function to punish driving on grass for too long
- increase downsampling for less information input
- Remove going straight from action space (only allowing gas + direction)