Skip to content

A toolbox with the goal of speeding up research on bargaining in MARL (cooperation problems in MARL).

License

Notifications You must be signed in to change notification settings

longtermrisk/marltoolbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

marltoolbox: Facilitate and speed up the research on bargaining in MARL.

CI CI Notebooks CI Weekly Tests Codacy Badge

Table of contents

Overview

Major features of this toolbox:
This toolbox contains algorithms, environments, evaluation tools, and helper functions to conduct research on bargaining in MARL.

This toolbox relies on the Ray/Tune/RLLib framework to provide the basic RL components and research functionalities.

Additional features of using the Ray/Tune/RLLib research framework:

  • using components from RLLib with extensive configuration available (e.g. using a PPO policy or a priority replay buffer)
  • track your experiments, log easily in TensorBoard, run hyperparameter search
  • be agnostic to the deep learning framework
  • create new algorithms using the very simple Tune API or the RLLib API
  • use the RLLib API to take advantage of a fully customizable training pipeline
  • create distributed algorithms (e.g. by using the policy factory of RLLib)

Philosophy: Implement when needed. Improve at each new use. Keep it simple. Keep it flexible. Keep the maintenance cost low.

Support: We actively support researchers by adding tools that they see relevant for research on bargaining in MARL.

Get started

How to use this toolbox

Introduction

marltoolbox is a toolbox in that you should fork/clone and customize for yourself. You can create new experiments by starting from the existing examples. You should edit/inherit any functionality that doesn't fit exactly your needs. This repository is intended as a toolbox that can be shared in a research team. It is not intended to be used in production.
marltoolbox is not a framework that provide a simple API to run experiments in a few lines of codes (this is a feature of RLLib).

RLLib is built on top of Tune and Tune is built on top of Ray. This toolbox marltoolbox, is built to work with RLLib but also to allow to fallback to Tune only if needed, at the cost of some functionalities.

To speed up research, we advise to take advantages of the functionalities of Tune and RLLib.

a) Read the README of the Ray project (which includes Tune and RLLib):

Ray README (<5 min)

b) Read this quick introduction to Tune

Tune's key concepts (< 5 min)

c) Read this quick introduction to RLLib

RLlib in 60 seconds (< 5 min)

d) Introduction to this toolbox:

Without any local installation, you can work through 2 tutorials to introduce marltoolbox together with Tune and RLLib.
Please use Google Colab to run them:

Advanced introduction

To explore Tune further:

To explore RLLib further:

To explore the toolbox marltoolbox further, take a look at our examples.

Toolbox installation

The installation is tested with Ubuntu 18.04 LTS (preferred) and 20.04 LTS.
It requires less than 20 Go of space including all the dependencies like PyTorch, etc.

(Optional) Connect to your virtual machine(VM) on Google Cloud Platform(GCP)
gcloud compute ssh {replace-by-instance-name}
(Usually optional) Do some basic upgrade and install some basic requirements (e.g. needed on a new VM)
sudo apt update
sudo apt upgrade
sudo apt-get install build-essential
# Run this command another time (especially needed with Ubuntu 20.04 LTS)
sudo apt-get install build-essential
(Optional) Use a virtual environment
# If needed, install conda:
## Follow instruction at
https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
## Like that:
	wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
	bash Miniconda3-latest-Linux-x86_64.sh
	# Enter. Enter... yes. Enter. yes.
	exit  
	# Connect again to the VM or open a new terminal
        gcloud compute ssh {replace-by-instance-name} 
	# Check your conda installation  
	conda list

# Create a virtual environment:
conda create -y -n marltoolbox python=3.8.5
conda activate marltoolbox
pip install --upgrade pip

Install the toolbox: marltoolbox

## Install dependencies
### For RLLib
conda install -y psutil
### (optional) To be able to use most of the gym environments
sudo apt-get install -y libglu1-mesa-dev libgl1-mesa-dev libosmesa6-dev xvfb ffmpeg curl patchelf libglfw3 libglfw3-dev cmake zlib1g zlib1g-dev swig

## Install marltoolbox
git clone https://github.com/longtermrisk/marltoolbox.git
cd marltoolbox

## Here are different installation instructions to support different algorithms
### Default install
pip install -e .
### If you are planning to use LOLA then run instead:
conda install -y python=3.6
pip install -e .[lola]

Test the installation

# Check that RLLib is working
## Use RLLib built-in training functionalities
rllib train --run=PPO --env=CartPole-v0 --torch 
## Ctrl+C to stop the training 

# Check that the toolbox is working
python ./marltoolbox/examples/rllib_api/pg_ipd.py
## You should get the status TERMINATED

# Visualize the logs
tensorboard --logdir ~/ray_results
## If working on GCP: forward the connection from a Virtual Machine(VM) to your machine
## Run this command on your local machine from another terminal (not in the VM)
gcloud compute ssh {replace-by-instance-name} -- -NfL 6006:localhost:6006
## Go to your browser to visualize the url http://localhost:6006/
(Optional) Install additional deep learning libraries (PyTorch CPU only is installed by default)
# Install PyTorch with GPU
# Check cuda version
nvidia-smi
# Look for "CUDA Version: XX.X"
# With the right cuda version:
conda install pytorch torchvision cudatoolkit=[cuda version like 10.2] -c pytorch
# Check PyTorch installation and if your GPU is available to PyTorch
python
    import torch
    torch.__version__
    torch.cuda.is_available()
    exit()

# Install Tensorflow
pip install tensorflow

Training models

Probably the greatest value of using RLLib/Tune and this toolbox is that you can use the provided environments, policies and some components of marltoolbox and RLLib (like a PPO agent) anywhere (e.g. without using Tune nor RLLib for anything else).

Yet we recommend to use Tune and if possible RLLib. There are mainly 3 ways to run experiments with Tune or RLLib. They support increasing functionalities but also use more and more constrained APIs.

Tune function API (the less constrained, not recommended)
  • Constraints: With the Tune function API, you only need to provide the training function. See the Tune documentation.
  • Best used: If you want to very quickly run some code from an external repository.
  • Functionalities: Running several seeds in parallel and comparing their results. Easily plot values to TensorBoard and visualizing the plots in live. Tracking your experiments and hyperparameters. Hyperparameter search. Early stopping.
Tune class API (very few constraints, recommended)
  • Constraints: You need to provide a Trainer class with at minimum a setup method and a step method. See the Tune documentation.
  • Best used: If you want to run some code from an external repository and you need checkpoints. Helpers in this toolbox (marltoolbox.utils.policy.get_tune_policy_class) will also allow you transform this class (already trained) into frozen RLLib policies. This is useful to produce evaluation against other RLLib algorithms or when using experimentation tools from marltoolbox.utils.
  • Additional functionalities: Cleaner format. Checkpoints. Allow conversion to the RLLib policy API.
    The trained agents can be converted to the RLLib policy API for evaluation only. This allows you to use functionalities which rely on the RLLib API (but not training).
RLLib API (quite constrained, recommended)
  • Constraints: You need to use the RLLib API (trainer, policy, callbacks, etc.). For information, RLLib trainer classes are specific implementations of the Tune class API (just above). See the RLLib documentation.
  • Best used: If you are creating a new training setup or policy from scratch. Or if you want a seamless integration with all RLLib components. Or if you need distributed training.
  • Additional functionalities: Using easily all components from RLLib (models, environments, algorithms, exploration, schedulers, preprocessing, etc.). Using the customizable trainer and policy factories from RLLib.

Some usages

Fall back to the Tune APIs when using the RLLib API is too costly

If the setup you want to train already exist, has a training loop and if the cost to convert it into RLLib is too expensive, then with minimum changes you can use Tune.

When is the conversion cost to RLLib too high?

  • If the algorithm has a complex unusual dataflow
  • If the algorithm has an unusual training process
    • like LOLA: performing "virtual" opponent updates
    • like LTFT: nested algorithms
  • If you don't need to change the algorithm
  • If you don't plan to run the algorithm against policies from RLLib
  • If you do not plan to work much with the algorithm. And thus, you do not want to invest time in the conversion to RLLib.
  • Some points above and you are only starting to use RLLib
  • etc.
Tutorials:
  • Tutorial_Basics_How_to_use_the_toolbox.ipynb
Examples:

You can find such examples in marltoolbox.examples.tune_class_api and in marltoolbox.examples.tune_function_api.

Using components directly provided by RLLib or marltoolbox
Tutorials:
  • Tutorial_Basics_How_to_use_the_toolbox.ipynb
a) Examples using the Tune class API:
  • Using an A3C policy: amd.py with use_rllib_policy = True (toolbox example)
  • Using (custom or not) environments:
    • IPD and coin game environments: amd.py (toolbox example)
    • Asymmetric coin game environment: lola_pg_official.py (toolbox example)
b) Examples using the RLLib API:
Customizing existing algorithms from RLLib
Examples:
  • Customize policy's postprocessing (processing after env.step) and trainer: inequity_aversion.py (toolbox example)
  • Change the loss function of the Policy Gradient (PG) Policy: rock_paper_scissors_multiagent.py
    (in the run_with_custom_entropy_loss function)
Creating and using new custom policies in RLLib

In RLLib, customizing a policy allows to change its training and evaluation logics.

Examples:
Using custom dataflows in RLLib (custom Trainer or Trainer's execution_plan)
Examples:
  • Training 2 different policies with 2 different Trainers (less complex but less sample efficient than the 2nd method below): multi_agent_two_trainers.py
  • Training 2 different policies with a custom Trainer (more complex, more sample efficient): two_trainer_workflow.py
Using experimentation tools from the toolbox
Tutorials:
  • Evaluations_Level_1_best_response_and_self_play_and_cross_play.ipynb
Examples:
  • Training a level 1 best response: l1br_amtft.py (toolbox example)
  • Evaluating same-play and cross-play performances: amtft_various_env.py (toolbox example)

Main contents of the toolbox

Environments
Algorithms
Utils
  • exploration
    • SoftQ with temperature schedule
    • SoftQ with clustering of the Q values
  • log
    • callbacks to log values from environments and policies
  • lvl1_best_response
    • helper functions to train level 1 exploiters
  • policy
    • helper to transform a trained Tune Trainer into frozen RLLib policies
  • postprocessing
    • helpers to compute welfare functions and add this data in the evaluation batch (the batches sampled by the evaluation workers)
  • restore
    • helpers to load a checkpoint only for a chosen policy (instead of for all existing policies as RLLib does)
  • rollout
    • a rollout runner function which can be called from inside a RLLib policy
  • self_and_cross_perf
    • a helper to evaluate the performance in self-play and cross-play.
      "self-play": playing against agents from the same training run.
      "cross-play": playing against agents from different training runs.
  • plot
    • helpers to plot results
Scripts
  • aggregate_and_plot_tensorboard_data
    • a script to aggregate the logged values from several seeds (into mean, std, etc.) and to create summary plots

TODO and wishlist

Improvements
  • Add unit tests for the algorithms
  • Refactor the algorithm to make them more readable
  • Use the logger everywhere
  • Add and improve docstrings
  • Set good hyper-parameters in the custom examples
  • Report all results directly in Weights&Biases (saving download time from VM)
New algorithms
  • Multi-agent adversarial IRL
  • Multi-agent generative adversarial imitation learning
  • Model-based RL like PETS, MPC
  • Opponent modeling like k-level
  • Capability to use algorithms from OpenSpiel like MCTS
New functionalities
  • Reward uncertainty
  • Full / partial observability of opponent actions
  • (partial) Parameter transparency
  • Easy benchmarking with metrics specific to MARL
  • (more on) Exploitability evaluation
  • Performance against a suite of other MARL algorithms
New environments
- Capability to use environments from OpenSpiel
- (iterated) Ultimatum game (including variants)