Motivation#

Reinforcement learning has attracted significant interest in recent years after striking performances obtained in board games [Silver et al., 2018] and video games [Berner et al., 2019, Vinyals et al., 2019]. Solving these grand challenges constitutes an important milestone in the field. However, the corresponding agents require efficient simulators due to their high sample complexity1. Outside of games, many important applications, e.g. healthcare, can also be naturally formulated as reinforcement learning problems. However, simulators for these scenarios may not be available, reliable, or efficient.

The development of reinforcement learning methods that explore efficiently has long been considered (one of) the most crucial efforts to reduce sample complexity. Meticulously evaluating the strengths and weaknesses of such methods is essential to assess progress and inspire new developments in the field. Such empirical evaluations must be performed using benchmarks composed of a selection of environments and evaluation criteria.

\(\texttt{Colosseum}\) is the brainchild of the aspiration of its authors to develop a rigorous benchmarking methodology for reinforcement learning algorithms. We argue that the environment selection should be based on theoretically principled reasoning that considers the hardness of the environments and the soundness of the evaluation criteria.

In non-tabular reinforcement learning, there is no theory of hardness except for a few restricted settings. Consequently, the selection of environments in current benchmarks [Osband et al., 2020, Rajan et al., 2019] relies solely on the experience of their authors. Although such benchmarks are certainly valuable, there is no guarantee that they contain a sufficiently diverse range of environments and that they are effectively able to quantify the agents’ capabilities. In contrast, in tabular reinforcement learning, a rich theory of hardness of environments is available. \(\texttt{Colosseum}\) leverages such theory to develop a principle benchmarking procedure. Accordingly, the environments are selected to maximize diversity with respect to two important measures of hardness, providing a varied set of challenges for which a precise characterization of hardness is available, and the evaluation criterion is the exact cumulative regret, which \(\texttt{Colosseum}\) efficiently computes.

Further details can be found in the accompanying paper.

BBC+19

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, and others. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.

ODH+20

Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvári, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, and Hado van Hasselt. Behaviour suite for reinforcement learning. In International Conference on Learning Representations. 2020.

RDG+19

Raghu Rajan, Jessica Lizeth Borja Diaz, Suresh Guttikonda, Fabio Ferreira, André Biedenkapp, Jan Ole von Hartz, and Frank Hutter. MDP playground: a design and debug testbed for reinforcement learning. arXiv preprint arXiv:1909.07750, 2019.

SHS+18

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, and others. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 2018.

VBC+19

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, and others. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.


1

the number of observations that they require to optimize a reward-based criterion in an unknown environment.