The Colosseum Benchmark#

The default \(\texttt{Colosseum}\) benchmark targets the four most widely studies setting of reinforcement learning: episodic ergodic, episodic communicating, continuous ergodic, and continuous communicating. Note that the continuous setting is also known as infinite horizon. The environments selection is the result of the theoretical analysis and empirical validation presented in the paper.

The default benchmark environments

The tables below report the environments in the benchmark along with their parameters. We briefly describe the parameters that are common to all environment below, and we refer to the API of the environment classes for the meaning of class-specific parameters.

The size parameter controls the number of states, the make_reward_stochastic parameter checks whether the rewards are stochastic, the p_lazy parameter is the probability that an MDP stays in the same state instead of executing the action selected by an agent, and the p_rand parameter is the probability that an MDP executes a random action instead of the action selected by an agent.

Episodic ergodic

  1. DeepSea(size=10, p_rand=0.4, make_reward_stochastic=False)

  2. DeepSea(size=13, p_rand=0.3, make_reward_stochastic=True)

  1. FrozenLake(size=4, make_reward_stochastic=True, p_lazy=0.03, p_frozen=0.98, p_rand=0.001)

  1. MiniGridEmpty(size=10, make_reward_stochastic=False, p_lazy=None, p_rand=0.05, n_starting_states=3)

  2. MiniGridEmpty(size=10, make_reward_stochastic=True, p_lazy=None, p_rand=0.3, n_starting_states=3)

  3. MiniGridEmpty(size=10, make_reward_stochastic=False, p_lazy=0.05, p_rand=0.2, n_starting_states=3)

  4. MiniGridEmpty(size=8, make_reward_stochastic=False, p_lazy=None, p_rand=0.3, n_starting_states=3)

  5. MiniGridEmpty(size=8, make_reward_stochastic=True, p_lazy=0.02, p_rand=0.4, n_starting_states=3)

  6. MiniGridEmpty(size=6, make_reward_stochastic=True, p_lazy=0.1, p_rand=0.05, n_starting_states=3)

  7. MiniGridEmpty(size=4, make_reward_stochastic=False, p_lazy=0.1, p_rand=0.4, n_starting_states=3)

  1. MiniGridRooms(make_reward_stochastic=True, room_size=3, n_rooms=4, p_rand=0.001, p_lazy=None)

  2. MiniGridRooms(make_reward_stochastic=True, room_size=3, n_rooms=9, p_rand=0.1, p_lazy=0.01)

  3. MiniGridRooms(make_reward_stochastic=True, room_size=4, n_rooms=4, p_rand=0.1, p_lazy=0.01)

  1. RiverSwim(make_reward_stochastic=True, size=5, p_lazy=0.1, p_rand=0.01, sub_optimal_distribution=(“beta”,(2.4, 24.0)), optimal_distribution=(“beta”,(0.01, 0.11)), other_distribution=(“beta”,(2.4, 249.0)))

  2. RiverSwim(make_reward_stochastic=True, size=30, p_lazy=0.01, p_rand=0.01, sub_optimal_distribution=(“beta”,(14.9, 149.0)), optimal_distribution=(“beta”,(0.01, 0.11)), other_distribution=(“beta”,(14.9, 1499.0)))

  1. SimpleGrid(reward_type=3, size=10, make_reward_stochastic=True, p_lazy=0.2, p_rand=0.01)

  2. SimpleGrid(reward_type=3, size=16, make_reward_stochastic=True, p_lazy=0.2, p_rand=0.2)

  3. SimpleGrid(reward_type=3, size=16, make_reward_stochastic=False, p_lazy=0.01, p_rand=0.2)

  1. Taxi(make_reward_stochastic=True, p_lazy=0.001, size=4, length=1, width=1, space=1, n_locations=3, p_rand=0.01, default_r=(“beta”,(0.8, 20.0)), successfully_delivery_r=(“beta”,(1.0, 0.1)), failure_delivery_r=(“beta”,(0.8, 50.0)))

  2. Taxi(make_reward_stochastic=True, p_lazy=0.01, size=4, length=1, width=1, space=1, n_locations=3, p_rand=0.2, default_r=(“beta”,(0.8, 6.0)), successfully_delivery_r=(“beta”,(1.0, 0.1)), failure_delivery_r=(“beta”,(0.8, 40.0)))

Episodic communicating

  1. DeepSea(size=5, p_rand=None, make_reward_stochastic=True)

  2. DeepSea(size=25, p_rand=None, make_reward_stochastic=True)

  1. FrozenLake(size=3, make_reward_stochastic=True, p_lazy=None, p_frozen=0.9, p_rand=None)

  1. MiniGridEmpty(size=6, make_reward_stochastic=True, p_lazy=None, p_rand=None, n_starting_states=3)

  2. MiniGridEmpty(size=6, make_reward_stochastic=True, p_lazy=0.35, p_rand=None, n_starting_states=5)

  3. MiniGridEmpty(size=6, make_reward_stochastic=True, p_lazy=0.25, p_rand=0.1, n_starting_states=3)

  4. MiniGridEmpty(size=10, make_reward_stochastic=False, p_lazy=0.1, p_rand=None, n_starting_states=5)

  5. MiniGridEmpty(size=10, make_reward_stochastic=False, p_lazy=0.15, p_rand=0.1, n_starting_states=3)

  1. MiniGridRooms(make_reward_stochastic=True, room_size=4, n_rooms=4, p_rand=None, p_lazy=None)

  2. MiniGridRooms(make_reward_stochastic=False, room_size=3, n_rooms=9, p_rand=None, p_lazy=0.05)

  3. MiniGridRooms(make_reward_stochastic=False, room_size=3, n_rooms=9, p_rand=None, p_lazy=0.1)

  4. MiniGridRooms(make_reward_stochastic=False, room_size=3, n_rooms=4, p_rand=None, p_lazy=0.15)

  1. RiverSwim(make_reward_stochastic=True, size=25, p_lazy=0.05, p_rand=None, sub_optimal_distribution=(“beta”,(12.4, 124.0)), optimal_distribution=(“beta”,(0.01, 0.11)), other_distribution=(“beta”,(12.4, 1249.0)))

  2. RiverSwim(make_reward_stochastic=False, size=40, p_lazy=None, p_rand=None, sub_optimal_distribution=None, optimal_distribution=None, other_distribution=None)

  1. SimpleGrid(size=10, make_reward_stochastic=True, p_lazy=0.5)

  2. SimpleGrid(size=13, make_reward_stochastic=True, p_lazy=0.4)

  3. SimpleGrid(size=13, make_reward_stochastic=False, p_lazy=None)

  4. SimpleGrid(size=20, make_reward_stochastic=True, p_lazy=0.4)

  1. Taxi(make_reward_stochastic=True, p_lazy=0.01, size=4, length=1, width=1, space=1, n_locations=3, p_rand=None, default_r=(“beta”,(0.7, 30.0)), successfully_delivery_r=(“beta”,(0.4, 0.1)), failure_delivery_r=(“beta”,(0.8, 50.0)))

  2. Taxi(make_reward_stochastic=True, p_lazy=0.2, size=5, length=1, width=1, space=1, n_locations=3, p_rand=None, default_r=(“beta”,(0.7, 30.0)), successfully_delivery_r=(“beta”, (0.4, 0.1)), failure_delivery_r=(“beta”,(0.8, 50.0)))

Continuous ergodic

  1. DeepSea(size=20, p_rand=0.1, make_reward_stochastic=False)

  1. FrozenLake(size=5, make_reward_stochastic=True, p_lazy=0.01, p_frozen=0.95, p_rand=0.05)

  1. MiniGridEmpty(size=12, make_reward_stochastic=True, p_lazy=0.05, p_rand=0.495, n_starting_states=3)

  2. MiniGridEmpty(size=12, make_reward_stochastic=True, p_lazy=0.1, p_rand=0.395, n_starting_states=3)

  3. MiniGridEmpty(size=10, make_reward_stochastic=False, p_lazy=0.02, p_rand=0.7, n_starting_states=3)

  4. MiniGridEmpty(size=14, make_reward_stochastic=True, p_lazy=0.02, p_rand=0.6, n_starting_states=3)

  5. MiniGridEmpty(size=10, make_reward_stochastic=True, p_lazy=0.1, p_rand=0.21, n_starting_states=3, optimal_distribution=(“beta”,(1.0, 0.11)), other_distribution=(“beta”,(1.0, 4.0)))

  6. MiniGridEmpty(size=14, make_reward_stochastic=True, p_lazy=0.02, p_rand=0.4, n_starting_states=3, optimal_distribution=(“beta”,(0.5, 0.11)), other_distribution=(“beta”,(1.5, 4.0)))

  7. MiniGridEmpty(size=14, make_reward_stochastic=True, p_lazy=0.05, p_rand=0.31, n_starting_states=3, optimal_distribution=(“beta”,(1.0, 0.11)), other_distribution=(“beta”,(1.0, 4.0)))

  8. MiniGridEmpty(size=14, make_reward_stochastic=True, p_lazy=0.1, p_rand=0.6, n_starting_states=3, optimal_distribution=(“beta”,(0.3, 0.11)), other_distribution=(“beta”,(2.0, 4.0)))

  1. MiniGridRooms(make_reward_stochastic=True, room_size=3, n_rooms=16, p_rand=0.1, p_lazy=0.4)

  2. MiniGridRooms(make_reward_stochastic=False, room_size=5, n_rooms=9, p_rand=0.3, p_lazy=0.4)

  1. RiverSwim(make_reward_stochastic=False, size=30, p_lazy=0.1, p_rand=0.2)

  2. RiverSwim(make_reward_stochastic=True, size=50, p_lazy=0.1, p_rand=0.1)

  3. RiverSwim(make_reward_stochastic=True, size=80, p_lazy=0.001, p_rand=0.2)

  4. RiverSwim(make_reward_stochastic=False, size=80, p_lazy=0.1, p_rand=0.01)

  1. SimpleGrid(size=15, make_reward_stochastic=True, p_lazy=0.4, p_rand=0.4)

  2. SimpleGrid(size=10, make_reward_stochastic=False, p_lazy=0.2, sub_optimal_distribution=None, optimal_distribution=None, other_distribution=None, p_rand=0.01)

  3. SimpleGrid(size=20, make_reward_stochastic=False, p_lazy=0.1, sub_optimal_distribution=None, optimal_distribution=None, other_distribution=None, p_rand=0.1)

  1. Taxi(make_reward_stochastic=True, p_lazy=0.01, size=4, length=1, width=1, space=1, n_locations=3, p_rand=0.1, default_r=(“beta”,(0.8, 20.0)), successfully_delivery_r=(“beta”,(1.0, 0.1)), failure_delivery_r=(“beta”,(0.8, 50.0)))

Continuous communicating

  1. DeepSea(size=40, p_rand=None, make_reward_stochastic=True)

  2. DeepSea(size=40, p_rand=None, make_reward_stochastic=False)

  3. DeepSea(size=35, p_rand=None, make_reward_stochastic=True)

  1. FrozenLake(size=4, make_reward_stochastic=True, p_lazy=0.01, p_frozen=0.95, p_rand=None)

  2. FrozenLake(size=5, make_reward_stochastic=True, p_lazy=0.35, p_frozen=0.9, p_rand=None)

  1. MiniGridEmpty(size=12, make_reward_stochastic=True, p_lazy=0.25, p_rand=None, n_starting_states=3, optimal_distribution=(“beta”,(1.0, 0.11)), other_distribution=(“beta”,(1.0, 4.0)))

  2. MiniGridEmpty(size=12, make_reward_stochastic=False, p_lazy=0.3, p_rand=None, n_starting_states=3, optimal_distribution=None, other_distribution=None)

  3. MiniGridEmpty(size=8, make_reward_stochastic=False, p_lazy=0.3, p_rand=None, n_starting_states=3, optimal_distribution=None, other_distribution=None)

  4. MiniGridEmpty(size=8, make_reward_stochastic=True, p_lazy=0.7, p_rand=None, n_starting_states=3, optimal_distribution=(“beta”,(1.0, 0.11)), other_distribution=(“beta”,(1.0, 4.0)))

  5. MiniGridEmpty(size=12, make_reward_stochastic=True, p_lazy=0.7, p_rand=None, n_starting_states=3, optimal_distribution=(“beta”,(1.0, 0.11)), other_distribution=(“beta”,(1.0, 4.0)))

  1. MiniGridRooms(make_reward_stochastic=True, room_size=5, n_rooms=9, p_rand=None, p_lazy=0.3)

  2. MiniGridRooms(make_reward_stochastic=True, room_size=3, n_rooms=9, p_rand=None, p_lazy=0.5)

  3. MiniGridRooms(make_reward_stochastic=True, room_size=5, n_rooms=9, p_rand=None, p_lazy=0.5)

  1. RiverSwim(make_reward_stochastic=True, size=25, p_lazy=0.1, p_rand=None)

  2. RiverSwim(make_reward_stochastic=True, size=90, p_lazy=0.03, p_rand=None)

  1. SimpleGrid(size=15, make_reward_stochastic=True, p_lazy=0.2, p_rand=None, sub_optimal_distribution=(“beta”,(0.3, 49.0)), optimal_distribution=(“beta”,(2.0, 0.11)), other_distribution=(“beta”,(0.3, 4.0)))

  2. SimpleGrid(size=16, make_reward_stochastic=False, p_lazy=0.065, p_rand=None, sub_optimal_distribution=None, optimal_distribution=None, other_distribution=None)

  3. SimpleGrid(size=25, make_reward_stochastic=True, p_lazy=0.1, p_rand=None, sub_optimal_distribution=(“beta”,(0.3, 49.0)), optimal_distribution=(“beta”,(2.0, 0.11)), other_distribution=(“beta”,(0.3, 4.0)))

  4. SimpleGrid(size=25, make_reward_stochastic=False, p_lazy=0.3, p_rand=None, sub_optimal_distribution=None, optimal_distribution=None, other_distribution=None)

  1. Taxi(make_reward_stochastic=True, p_lazy=0.01, size=4, length=1, width=1, space=1, n_locations=3, p_rand=None, default_r=(“beta”,(0.7, 30.0)), successfully_delivery_r=(“beta”,(0.4, 0.1)), failure_delivery_r=(“beta”,(0.8, 50.0)))

Instantiate the default benchmark

A benchmark in \(\texttt{Colosseum}\) can be instantiated using the ColosseumBenchmark class. A ColosseumBenchmark object contains the parameters of the MDP and an ExperimentConfig, which regulates the agent/MDP interactions. The default benchmark can be accesses through the ColosseumDefaultBenchmark enumeration, which also allow to retrieve the benchmark as shown below.

# Locally instantiate the episodic ergodic benchmark with folder name "benchmark_er"
instantiate_benchmark_folder(
    ColosseumDefaultBenchmark.EPISODIC_ERGODIC.get_benchmark(), "benchmark_er"
)

# Locally instantiate the episodic communicating benchmark with folder name "benchmark_ec"
instantiate_benchmark_folder(
    ColosseumDefaultBenchmark.EPISODIC_COMMUNICATING.get_benchmark(), "benchmark_ec"
)

# Locally instantiate the continuous ergodic benchmark with folder name "benchmark_ce"
instantiate_benchmark_folder(
    ColosseumDefaultBenchmark.CONTINUOUS_ERGODIC.get_benchmark(), "benchmark_ce"
)

# Locally instantiate the continuous communicating benchmark with folder name "benchmark_cc"
instantiate_benchmark_folder(
    ColosseumDefaultBenchmark.CONTINUOUS_COMMUNICATING.get_benchmark(), "benchmark_cc"
)

# Print the folder structure of the benchmark
for bdir in glob("benchmark_*"):
    seedir.seedir(bdir, style="emoji")
print("-" * 70)
# Print the benchmark configurations
print_benchmark_configurations()
📁 benchmark_er/
├─📁 mdp_configs/
│ ├─📄 MiniGridRoomsEpisodic.gin
│ ├─📄 DeepSeaEpisodic.gin
│ ├─📄 SimpleGridEpisodic.gin
│ ├─📄 MiniGridEmptyEpisodic.gin
│ ├─📄 TaxiEpisodic.gin
│ ├─📄 RiverSwimEpisodic.gin
│ └─📄 FrozenLakeEpisodic.gin
└─📄 experiment_config.yml
📁 benchmark_ec/
├─📁 mdp_configs/
│ ├─📄 MiniGridRoomsEpisodic.gin
│ ├─📄 DeepSeaEpisodic.gin
│ ├─📄 SimpleGridEpisodic.gin
│ ├─📄 MiniGridEmptyEpisodic.gin
│ ├─📄 TaxiEpisodic.gin
│ ├─📄 RiverSwimEpisodic.gin
│ └─📄 FrozenLakeEpisodic.gin
└─📄 experiment_config.yml
📁 benchmark_ce/
├─📁 mdp_configs/
│ ├─📄 MiniGridEmptyContinuous.gin
│ ├─📄 TaxiContinuous.gin
│ ├─📄 DeepSeaContinuous.gin
│ ├─📄 MiniGridRoomsContinuous.gin
│ ├─📄 RiverSwimContinuous.gin
│ ├─📄 SimpleGridContinuous.gin
│ └─📄 FrozenLakeContinuous.gin
└─📄 experiment_config.yml
📁 benchmark_cc/
├─📁 mdp_configs/
│ ├─📄 MiniGridEmptyContinuous.gin
│ ├─📄 TaxiContinuous.gin
│ ├─📄 DeepSeaContinuous.gin
│ ├─📄 MiniGridRoomsContinuous.gin
│ ├─📄 RiverSwimContinuous.gin
│ ├─📄 SimpleGridContinuous.gin
│ └─📄 FrozenLakeContinuous.gin
└─📄 experiment_config.yml
----------------------------------------------------------------------
Default experiment configuration in the 'experiment_config.yml' files.
	n_seeds: 20
	n_steps: 500000
	max_interaction_time_s: 600
	log_performance_indicators_every: 100

The mdp_configs folders for the different setting contain the Gin files with the configurations of the MDPs that were previously presented. The experiment_config.yml file contains the default ExperimentConfig.