Inspect Agents Performances#

\(\texttt{Colosseum}\) implements several performance indicators. After describing them, we report the agent classes and the environment classes implemented in the package. Then we’ll compare the performances of two agents in the same environment, and the performances of an agent across different MDPs. When comparing across different environments, normalized versions of the indicators should be used to account for transition kernel and reward kernel differences.

Indicators

Reward-based indicators
The \(\texttt{cumulative_reward}\) measures the empirical cumulative reward the agent obtains while interacting with the MDP. The \(\texttt{cumulative_expected_reward}\) indicator measures the expected average reward for the agent’s best policy at fixed intervals and cumulates them.
Regret-based indicators
The \(\texttt{cumulative_regret}\) computes the expected regret for the agent’s best policy at a fixed interval and cumulates them.
Baseline indicators
The \(\texttt{cumulative_expected_reward}\) and the \(\texttt{cumulative_regret}\) indicators are also available for the uniformly randomly acting policy, the optimal policy, and the worst-performing policy.
Computation cost
The \(\texttt{steps_per_second}\) indicators measures the average number of time steps per second.

The list of indicators can be obtained by calling the get_available_indicators function.

Available agents

Available MDP classes

Fixed MDP comparison

We compare QLearningEpisodic with PSRLEpisodic in RiverSwimEpisodic for a total of \(20\ 000\) time steps. As the MDP is fixed, we can compare the performances without necessarily using the normalized versions of the indicators.

seed = 42
optimization_horizon = 20_000
log_every=200

# Instantiate the MDP
mdp = RiverSwimEpisodic(seed=0, size=4, p_rand=0.01)

# Instantiate the agents
q_learning_agent = QLearningEpisodic(
    mdp_specs=make_mdp_spec(mdp),
    seed=seed,
    optimization_horizon=optimization_horizon,
    c_1 = 0.95,
    p = 0.05
)
psrl_agent = PSRLEpisodic(
    mdp_specs=make_mdp_spec(mdp),
    seed=seed,
    optimization_horizon=optimization_horizon,
)

# q-learning  interaction
loop_ql = MDPLoop(mdp, q_learning_agent)
loop_ql.run(T=optimization_horizon, log_every=log_every)

# PSRL interaction
loop_psrl = MDPLoop(mdp, psrl_agent)
loop_psrl.run(T=optimization_horizon, log_every=log_every)

Cumulative reward

# Create the shared axis for the plot
fig, ax = plt.subplots(1, 1, figsize=(12, 4))

# Plot the cumulative reward of the q-learning agent without the baselines
loop_ql.plot("cumulative_reward", ax, baselines=[])

# Plot the cumulative reward of the PSRL agent, including the baselines this time
loop_psrl.plot("cumulative_reward", ax)

# Display the figure
glue("cumulative_reward", fig, display=False)
../_images/agent-performance-analysis_5_0.png

Cumulative expected reward

fig1, ax = plt.subplots(1, 1, figsize=(12, 4))
loop_ql.plot("cumulative_expected_reward", ax, baselines=[])
loop_psrl.plot("cumulative_expected_reward", ax)
glue("cumulative_expected_reward", fig1, display=False)
../_images/agent-performance-analysis_7_0.png

Cumulative regret

fig, ax = plt.subplots(1, 1, figsize=(12, 4))
loop_ql.plot("cumulative_regret", ax, baselines=[])
loop_psrl.plot("cumulative_regret", ax)
glue("cumulative_regret", fig, display=False)
plt.show()
../_images/agent-performance-analysis_9_0.png

Steps per second

fig, ax = plt.subplots(1, 1, figsize=(12, 4))
loop_psrl.plot("steps_per_second", ax)
loop_ql.plot("steps_per_second", ax)
glue("steps_per_second", fig, display=False)
plt.show()
../_images/agent-performance-analysis_11_0.png

Fixed agent class comparison

We compare QLearningEpisodic in two different instances of RiverSwimEpisodic for a total of \(20\ 000\) time steps. In this case, the use of the normalized versions of the indicators.

seed = 42
optimization_horizon = 20_000

# Instantiate the MDPs
mdp_short = RiverSwimEpisodic(seed=0, size=4, p_rand=0.01)
mdo_long = RiverSwimEpisodic(seed=0, size=10, p_rand=0.01)

# Instantiate the agent
agent_for_short = QLearningEpisodic(
    mdp_specs=make_mdp_spec(mdp_short),
    seed=seed,
    optimization_horizon=optimization_horizon,
    c_1 = 0.95,
    p = 0.05
)
agent_for_long = QLearningEpisodic(
    mdp_specs=make_mdp_spec(mdo_long),
    seed=seed,
    optimization_horizon=2000,
    c_1 = 0.95,
    p = 0.05
)

# Short Riverswim interaction
loop_short = MDPLoop(mdp_short, agent_for_short)
loop_short.run(T=optimization_horizon, log_every=200)

# Long Riverswim interaction
loop_long = MDPLoop(mdo_long, agent_for_long)
loop_long.run(T=optimization_horizon, log_every=200)

Normalized cumulative reward

fig, ax = plt.subplots(1, 1, figsize=(12, 4))
loop_long.plot("normalized_cumulative_reward", ax, label="q-learning (1)", baselines=[])
loop_short.plot("normalized_cumulative_reward", ax, label="q-learning (2)")
glue("normalized_cumulative_reward", fig, display=False)
plt.close()
../_images/agent-performance-analysis_15_0.png

Normalized cumulative expected reward

fig, ax = plt.subplots(1, 1, figsize=(12, 4))
loop_long.plot("normalized_cumulative_expected_reward", ax, label="q-learning (1)")
loop_short.plot("normalized_cumulative_expected_reward", ax, label="q-learning (2)", baselines=[])
glue("normalized_cumulative_expected_reward", fig, display=False)
plt.close()
../_images/agent-performance-analysis_17_0.png

Normalized cumulative regret

fig, ax = plt.subplots(1, 1, figsize=(12, 4))
loop_long.plot("normalized_cumulative_regret", ax, label="q-learning (1)")
loop_short.plot("normalized_cumulative_regret", ax, label="q-learning (2)", baselines=[])
glue("normalized_cumulative_regret", fig, display=False)
plt.close()
../_images/agent-performance-analysis_19_0.png

Steps per second

fig, ax = plt.subplots(1, 1, figsize=(12, 4))
loop_long.plot("steps_per_second", ax, label="q-learning (1)")
loop_short.plot("steps_per_second", ax, label="q-learning (2)")
glue("normalized_steps_per_second", fig, display=False)
plt.close()
../_images/agent-performance-analysis_21_0.png