rlberry-py · TimotheeMathieu · Apr 9, 2024 · Apr 3, 2024 · Apr 3, 2024 · Apr 3, 2024
diff --git a/docs/basics/userguide/adastop.md b/docs/basics/userguide/adastop.md
@@ -0,0 +1,83 @@
+(adastop_userguide)=
+
+
+# AdaStop
+
+
+
+## Hypothesis testing to compare RL agents
+
+AdaStop is a Sequential testing for efficient and reliable comparison of stochastic algorithms first introduced in <https://arxiv.org/abs/2306.10882>.
+
+This section explains how to use the AdaStop algorithm in rlberry. AdaStop implements a sequential statistical test using group sequential permutation test and is especially adapted to multiple testing with very small sample size. The main library for AdaStop is in <https://github.com/TimotheeMathieu/adastop> but for comparison of RL agents, it is easier to use rlberry bindings.
+
+We use AdaStop in particular to to adaptively choose the number of training necessary to get a statistically significant decision on a comparison of algorithms. The rationale is that when the returns of some experiment in computer science is stochastic, it becomes necessary to make the same experiment several time in order to have a viable comparison of the algorithms and be able to rank them with a theoretically controlled family-wise error rate. Adastop allows us to choose the number of repetition adaptively to stop collecting data as soon as possible. Please note, that what we call here algorithm is really a certain implementation of an algorithm.
+
+
+
+## Comparison of A2C and PPO from stable-baselines3
+
+Below, we compare A2C and PPO agents from stable-baselines3 on Acrobot environment. We limit the maximum number of trainings for each agent to $5\times 5 = 25$, using $5$ batches of size $5$. We ask that the resulting test has a level of $99\%$ (i.e. the probability to wrongly say that the agents are different is $1\%$).
+
+```python
+from rlberry.envs import gym_make
+from stable_baselines3 import A2C, PPO
+from rlberry.agents.stable_baselines import StableBaselinesAgent
+from rlberry.manager import AdastopComparator
+
+env_ctor, env_kwargs = gym_make, dict(id="CartPole-v1")
+
+managers = [
+    {
+        "agent_class": StableBaselinesAgent,
+        "train_env": (env_ctor, env_kwargs),
+        "fit_budget": 5e4,
+        "agent_name": "A2C",
+        "init_kwargs": {"algo_cls": A2C, "policy": "MlpPolicy", "verbose": 1},
+    },
+    {
+        "agent_class": StableBaselinesAgent,
+        "train_env": (env_ctor, env_kwargs),
+        "agent_name": "PPO",
+        "fit_budget": 5e4,
+        "init_kwargs": {"algo_cls": PPO, "policy": "MlpPolicy", "verbose": 1},
+    },
+]
+
+comparator = AdastopComparator(n=5, K=5, alpha=0.01)
+comparator.compare(managers)
+print(comparator.managers_paths)
+```
+
+## Result visualisation
+
+The results of the comparison can be obtained either in text format using `print_results`
+
+```python
+comparator.print_results()
+```
+
+The result is found using 10 scores for each agent:
+```
+Number of scores used for each agent:
+A2C:10
+PPO:10
+
+Mean of scores of each agent:
+A2C:271.17600000000004
+PPO:500.0
+
+Decision for each comparison:
+A2C vs PPO:smaller
+```
+
+
+or with a plot using `plot_results`
+
+```python
+comparator.plot_results()
+```
+
+![](adastop_boxplots.png)
+
+The boxplots in the plot represent the distribution of the scores gathered for each agent. The table on the top of the figure represent the decisions taken by the test: larger, smaller or equal.
diff --git a/docs/basics/userguide/adastop_boxplots.png b/docs/basics/userguide/adastop_boxplots.png
diff --git a/docs/index.md b/docs/index.md
@@ -49,14 +49,12 @@ It could be useful in many way :
 See the [Save and Load Experiment](save_load_page) page.
 
 ### Statistical comparison of RL agents
-
+The principal goal of rlberry is to give tools for proper experimentations in RL. In research, one of the usual tasks is to compare two or more RL agents, and for this one typically uses several seeds to train the agents several times and compare the resulting mean reward. We show here how to make sure that enough data and enough information were acquired to assert that two RL agents are indeed different. We propose two ways to do that: first are classical hypothesis testing and second are sequential testing scheme with AdaStop that aim at saving computation by stopping early if possible.
 #### Compare agents
-Compare several trained agents using the mean over a specify number of evaluations for each agent.
-TODO : to complete
+We give tools to compare several trained agents using the mean over a specify number of evaluations for each agent. The explanation can be found in the [user guide](comparison_page).
 
 #### AdaStop
-TODO : Text
-
+AdaStop is a Sequential testing for efficient and reliable comparison of stochastic algorithms. It has been successfully used to compare efficiently RL agents and an example of such use can be found in the [user guide](adastop_userguide).
 
 [linked paper](https://hal-lara.archives-ouvertes.fr/hal-04132861/)
 

diff --git a/docs/user_guide.md b/docs/user_guide.md
@@ -45,4 +45,5 @@ You can find more details about installation [here](installation)!
 - Custom Environments (In construction)
 - [Using extrenal libraries](external) (like [Stable Baselines](stable_baselines) and [Gymnasium](Gymnasium_ancor))
 - Transfer Learning (In construction)
-- AdaStop(In construction)
+- [Hypothesis testing for comparison of RL agents](comparison_page)
+- [Adaptive hypothesis testing for comparison of RL agents with AdaStop](adastop_userguide)
diff --git a/rlberry/manager/comparison.py b/rlberry/manager/comparison.py
@@ -105,6 +105,28 @@ def compare(self, manager_list, n_evaluations=50, verbose=True):
         logger.info("Results are ")
         print(self.get_results())
 
+    def print_results(self):
+        """
+        Print the results of the test.
+        """
+        print("Number of scores used for each agent:")
+        for key in self.n_iters:
+            print(key + ":" + str(self.n_iters[key]))
+
+        print("")
+        print("Mean of scores of each agent:")
+        for key in self.eval_values:
+            print(key + ":" + str(np.mean(self.eval_values[key])))
+
+        print("")
+        print("Decision for each comparison:")
+        for c in self.comparisons:
+            print(
+                "{0} vs {1}".format(self.agent_names[c[0]], self.agent_names[c[1]])
+                + ":"
+                + str(self.decisions[str(c)])
+            )
+
     def _fit_evaluate(self, managers, eval_values, seeders):
         """
         fit rlberry agents.

diff --git a/rlberry/manager/tests/test_comparisons.py b/rlberry/manager/tests/test_comparisons.py
@@ -139,5 +139,6 @@ def test_adastop():
 
     comparator = AdastopComparator(seed=42)
     comparator.compare(managers)
+    comparator.print_results()
     assert comparator.is_finished
     assert not ("equal" in comparator.decisions.values())