# Agents¶

## CrossEntropyAgent¶

class numpy_ml.rl_models.agents.CrossEntropyAgent(env, n_samples_per_episode=500, retain_prcnt=0.2)[source]

A cross-entropy method agent.

Notes

The cross-entropy method   agent only operates on envs with discrete action spaces.

On each episode the agent generates n_theta_samples of the parameters ($$\theta$$) for its behavior policy. The i’th sample at timestep t is:

$\begin{split}\theta_i &= \{\mathbf{W}_i^{(t)}, \mathbf{b}_i^{(t)} \} \\ \theta_i &\sim \mathcal{N}(\mu^{(t)}, \Sigma^{(t)})\end{split}$

Weights ($$\mathbf{W}_i$$) and bias ($$\mathbf{b}_i$$) are the parameters of the softmax policy:

$\begin{split}\mathbf{z}_i &= \text{obs} \cdot \mathbf{W}_i + \mathbf{b}_i \\ p(a_i^{(t + 1)}) &= \frac{e^{\mathbf{z}_i}}{\sum_j e^{z_{ij}}} \\ a^{(t + 1)} &= \arg \max_j p(a_j^{(t+1)})\end{split}$

At the end of each episode, the agent takes the top retain_prcnt highest scoring $$\theta$$ samples and combines them to generate the mean and variance of the distribution of $$\theta$$ for the next episode:

$\begin{split}\mu^{(t+1)} &= \text{avg}(\texttt{best_thetas}^{(t)}) \\ \Sigma^{(t+1)} &= \text{var}(\texttt{best_thetas}^{(t)})\end{split}$

References

  Mannor, S., Rubinstein, R., & Gat, Y. (2003). The cross entropy method for fast policy search. In Proceedings of the 20th Annual ICML, 20.
  Rubinstein, R. (1997). optimization of computer simulation models with rare events, European Journal of Operational Research, 99, 89–112.
Parameters: env (gym.wrappers() or gym.envs() instance) – The environment to run the agent on. n_samples_per_episode (int) – The number of theta samples to evaluate on each episode. Default is 500. retain_prcnt (float) – The percentage of n_samples_per_episode to use when calculating the parameter update at the end of the episode. Default is 0.2.
act(obs)[source]

Generate actions according to a softmax policy.

Notes

The softmax policy assumes that the pmf over actions in state $$x_t$$ is given by:

$\pi(a | x^{(t)}) = \text{softmax}( \text{obs}^{(t)} \cdot \mathbf{W}_i^{(t)} + \mathbf{b}_i^{(t)} )$

where $$\mathbf{W}$$ is a learned weight matrix, obs is the observation at timestep t, and b is a learned bias vector.

Parameters: obs (int or ndarray) – An observation from the environment. action (int, float, or ndarray) – An action sampled from the distribution over actions defined by the softmax policy.
run_episode(max_steps, render=False)[source]

Run the agent on a single episode.

Parameters: max_steps (int) – The maximum number of steps to run an episode render (bool) – Whether to render the episode during training reward (float) – The total reward on the episode, averaged over the theta samples. steps (float) – The total number of steps taken on the episode, averaged over the theta samples.
update()[source]

Update $$\mu$$ and $$\Sigma$$ according to the rewards accrued on the current episode.

Returns: avg_reward (float) – The average reward earned by the best retain_prcnt theta samples on the current episode.
greedy_policy(max_steps, render=True)[source]

Execute a greedy policy using the current agent parameters.

Parameters: max_steps (int) – The maximum number of steps to run the episode. render (bool) – Whether to render the episode during execution. total_reward (float) – The total reward on the episode. n_steps (float) – The total number of steps taken on the episode.
flush_history()[source]

Clear the episode history

## DynaAgent¶

class numpy_ml.rl_models.agents.DynaAgent(env, lr=0.4, epsilon=0.1, n_tilings=8, obs_max=None, obs_min=None, q_plus=False, grid_dims=[8, 8], explore_weight=0.05, temporal_discount=0.9, n_simulated_actions=50)[source]

A Dyna-Q / Dyna-Q+ agent  with full TD(0) Q-learning updates via prioritized-sweeping  .

Notes

This approach consists of three components: a planning method involving simulated actions, a direct RL method where the agent directly interacts with the environment, and a model-learning method where the agent learns to better represent the environment during planning.

During planning, the agent performs random-sample one-step tabular Q-planning with prioritized sweeping. This entails using a priority queue to retrieve the state-action pairs from the agent’s history which would stand to have the largest change to their Q-values if backed up. Specifically, for state action pair (s, a) the priority value is:

$P = \sum_{s'} p(s') | r + \gamma \max_a \{Q(s', a) \} - Q(s, a) |$

which corresponds to the absolute magnitude of the TD(0) Q-learning backup for the pair.

When the first pair in the queue is backed up, the effect on each of its predecessor pairs is computed. If the predecessor’s priority is greater than a small threshold the pair is added to the queue and the process is repeated until either the queue is empty or we have exceeded n_simulated_actions updates. These backups occur without the agent taking any action in the environment and thus constitute simulations based on the agent’s current model of the environment (i.e., its tabular state-action history).

During the direct RL phase, the agent takes an action based on its current behavior policy and Q function and receives a reward from the environment. The agent logs this state-action-reward-new state tuple in its interaction table (i.e., environment model) and updates its Q function using a full-backup version of the Q-learning update:

$Q(s, a) \leftarrow Q(s, a) + \eta \sum_{r, s'} p(r, s' \mid s, a) \left(r + \gamma \max_a \{ Q(s', a) \} - Q(s, a) \right)$

References

  Sutton, R. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th Annual ICML, 216-224.
  Moore, A. & Atkeson, C. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13(1), 103-130.
Parameters: env (gym.wrappers or gym.envs instance) – The environment to run the agent on lr (float) – Learning rate for the Q function updates. Default is 0.05. epsilon (float between [0, 1]) – The epsilon value in the epsilon-soft policy. Larger values encourage greater exploration during training. Default is 0.1. n_tilings (int) – The number of overlapping tilings to use if the env observation space is continuous. Unused if observation space is discrete. Default is 8. obs_max (float or ndarray or None) – The value to treat as the max value of the observation space when calculating the grid widths if the observation space is continuous. If None, use env.observation_space.high(). Unused if observation space is discrete. Default is None. obs_min (float or ndarray or None) – The value to treat as the min value of the observation space when calculating grid widths if the observation space is continuous. If None, use env.observation_space.low(). Unused if observation space is discrete. Default is None. grid_dims (list) – The number of rows and columns in each tiling grid if the env observation space is continuous. Unused if observation space is discrete. Default is [8, 8]. q_plus (bool) – Whether to add incentives for visiting states that the agent hasn’t encountered recently. Default is False. explore_weight (float) – Amount to incentivize exploring states that the agent hasn’t recently visited. Only used if q_plus is True. Default is 0.05. temporal_discount (float between [0, 1]) – The discount factor used for downweighting future rewards. Smaller values result in greater discounting of future rewards. Default is 0.9. n_simulated_actions (int) – THe number of simulated actions to perform for each “real” action. Default is 50.
act(obs)[source]

Execute the behavior policy–an $$\epsilon$$-soft policy used to generate actions during training.

Parameters: obs (int, float, or ndarray as returned by env.step(action)) – An observation from the environment. action (int, float, or ndarray) – An action sampled from the distribution over actions defined by the epsilon-soft policy.
update()[source]

Update the priority queue with the most recent (state, action) pair and perform random-sample one-step tabular Q-planning.

Notes

The planning algorithm uses a priority queue to retrieve the state-action pairs from the agent’s history which will result in the largest change to its Q-value if backed up. When the first pair in the queue is backed up, the effect on each of its predecessor pairs is computed. If the predecessor’s priority is greater than a small threshold the pair is added to the queue and the process is repeated until either the queue is empty or we exceed n_simulated_actions updates.

run_episode(max_steps, render=False)[source]

Run the agent on a single episode without performing Q-function backups.

Parameters: max_steps (int) – The maximum number of steps to run an episode. render (bool) – Whether to render the episode during training. reward (float) – The total reward on the episode. steps (float) – The number of steps taken on the episode.
train_episode(max_steps, render=False)[source]

Train the agent on a single episode.

Parameters: max_steps (int) – The maximum number of steps to run an episode. render (bool) – Whether to render the episode during training. reward (float) – The total reward on the episode. steps (float) – The number of steps taken on the episode.
greedy_policy(max_steps, render=True)[source]

Execute a deterministic greedy policy using the current agent parameters.

Parameters: max_steps (int) – The maximum number of steps to run the episode. render (bool) – Whether to render the episode during execution. total_reward (float) – The total reward on the episode. n_steps (float) – The total number of steps taken on the episode.
flush_history()[source]

Clear the episode history

## MonteCarloAgent¶

Monte Carlo methods are ways of solving RL problems based on averaging sample returns for each state-action pair. Parameters are updated only at the completion of an episode.

In on-policy learning, the agent maintains a single policy that it updates over the course of training. In order to ensure the policy converges to a (near-) optimal policy, the agent must maintain that the policy assigns non-zero probability to ALL state-action pairs during training to ensure continual exploration.

• Thus on-policy learning is a compromise–it learns action values not for the optimal policy, but for a near-optimal policy that still explores.

In off-policy learning, the agent maintains two separate policies:

1. Target policy: The policy that is learned during training and that will eventually become the optimal policy.
2. Behavior policy: A policy that is more exploratory and is used to generate behavior during training.

Off-policy methods are often of greater variance and are slower to converge. On the other hand, off-policy methods are more powerful and general than on-policy methods.

class numpy_ml.rl_models.agents.MonteCarloAgent(env, off_policy=False, temporal_discount=0.9, epsilon=0.1)[source]

A Monte-Carlo learning agent trained using either first-visit Monte Carlo updates (on-policy) or incremental weighted importance sampling (off-policy).

Parameters: env (gym.wrappers or gym.envs instance) – The environment to run the agent on. off_policy (bool) – Whether to use a behavior policy separate from the target policy during training. If False, use the same epsilon-soft policy for both behavior and target policies. Default is False. temporal_discount (float between [0, 1]) – The discount factor used for downweighting future rewards. Smaller values result in greater discounting of future rewards. Default is 0.9. epsilon (float between [0, 1]) – The epsilon value in the epsilon-soft policy. Larger values encourage greater exploration during training. Default is 0.1.
act(obs)[source]

Execute the behavior policy–an $$\epsilon$$-soft policy used to generate actions during training.

Parameters: obs (int, float, or ndarray as returned by env.step(action)) – An observation from the environment. action (int, float, or ndarray) – An action sampled from the distribution over actions defined by the epsilon-soft policy.
run_episode(max_steps, render=False)[source]

Run the agent on a single episode.

Parameters: max_steps (int) – The maximum number of steps to run an episode. render (bool) – Whether to render the episode during training. reward (float) – The total reward on the episode. steps (float) – The number of steps taken on the episode.
update()[source]

Update the parameters of the model following the completion of an episode. Flush the episode history after the update is complete.

greedy_policy(max_steps, render=True)[source]

Execute a greedy policy using the current agent parameters.

Parameters: max_steps (int) – The maximum number of steps to run the episode. render (bool) – Whether to render the episode during execution. total_reward (float) – The total reward on the episode. n_steps (float) – The total number of steps taken on the episode.
flush_history()[source]

Clear the episode history

## TemporalDifferenceAgent¶

Temporal difference methods are examples of bootstrapping in that they update their estimate for the value of state s on the basis of a previous estimate.

1. They do not require a model of the environment, its reward, or its next-state probability distributions.
2. They are implemented in an online, fully incremental fashion. This allows them to be used with infinite-horizons / when episodes take prohibitively long to finish.
3. TD algorithms learn from each transition regardless of what subsequent actions are taken.
4. In practice, TD methods have usually been found to converge faster than constant-$$\alpha$$ Monte Carlo methods on stochastic tasks.
class numpy_ml.rl_models.agents.TemporalDifferenceAgent(env, lr=0.4, epsilon=0.1, n_tilings=8, obs_max=None, obs_min=None, grid_dims=[8, 8], off_policy=False, temporal_discount=0.99)[source]

A temporal difference learning agent with expected SARSA (on-policy)  or TD(0) Q-learning (off-policy)  updates.

Notes

The expected SARSA on-policy TD(0) update is:

$Q(s, a) \leftarrow Q(s, a) + \eta \left( r + \gamma \mathbb{E}_\pi[Q(s', a') \mid s'] - Q(s, a) \right)$

and the TD(0) off-policy Q-learning upate is:

$Q(s, a) \leftarrow Q(s, a) + \eta ( r + \gamma \max_a \left\{ Q(s', a) \right\} - Q(s, a) )$

where in each case we have taken action a in state s, received reward r, and transitioned into state $$s'$$. In the above equations, $$\eta$$ is a learning rate parameter, $$\gamma$$ is a temporal discount factor, and $$\mathbb{E}_\pi[ Q[s', a'] \mid s']$$ is the expected value under the current policy $$\pi$$ of the Q function conditioned that we are in state $$s'$$.

Observe that the expected SARSA update can be used for both on- and off-policy methods. In an off-policy context, if the target policy is greedy and the expectation is taken wrt. the target policy then the expected SARSA update is exactly Q-learning.

NB. For this implementation the agent requires a discrete action space, but will try to discretize the observation space via tiling if it is continuous.

References

  Rummery, G. & Niranjan, M. (1994). On-Line Q-learning Using Connectionist Systems. Tech Report 166. Cambridge University Department of Engineering.
  Watkins, C. (1989). Learning from delayed rewards. PhD thesis, Cambridge University.
Parameters: env (gym.wrappers or gym.envs instance) – The environment to run the agent on. lr (float) – Learning rate for the Q function updates. Default is 0.05. epsilon (float between [0, 1]) – The epsilon value in the epsilon-soft policy. Larger values encourage greater exploration during training. Default is 0.1. n_tilings (int) – The number of overlapping tilings to use if the env observation space is continuous. Unused if observation space is discrete. Default is 8. obs_max (float or ndarray) – The value to treat as the max value of the observation space when calculating the grid widths if the observation space is continuous. If None, use env.observation_space.high. Unused if observation space is discrete. Default is None. obs_min (float or ndarray) – The value to treat as the min value of the observation space when calculating grid widths if the observation space is continuous. If None, use env.observation_space.low. Unused if observation space is discrete. Default is None. grid_dims (list) – The number of rows and columns in each tiling grid if the env observation space is continuous. Unused if observation space is discrete. Default is [8, 8]. off_policy (bool) – Whether to use a behavior policy separate from the target policy during training. If False, use the same epsilon-soft policy for both behavior and target policies. Default is False. temporal_discount (float between [0, 1]) – The discount factor used for downweighting future rewards. Smaller values result in greater discounting of future rewards. Default is 0.9.
run_episode(max_steps, render=False)[source]

Run the agent on a single episode without updating the priority queue or performing backups.

Parameters: max_steps (int) – The maximum number of steps to run an episode render (bool) – Whether to render the episode during training reward (float) – The total reward on the episode, averaged over the theta samples. steps (float) – The total number of steps taken on the episode, averaged over the theta samples.
train_episode(max_steps, render=False)[source]

Train the agent on a single episode.

Parameters: max_steps (int) – The maximum number of steps to run an episode. render (bool) – Whether to render the episode during training. reward (float) – The total reward on the episode. steps (float) – The number of steps taken on the episode.
update()[source]

Update the parameters of the model online after each new state-action.

act(obs)[source]

Execute the behavior policy–an $$\epsilon$$-soft policy used to generate actions during training.

Parameters: obs (int, float, or ndarray as returned by env.step(action)) – An observation from the environment. action (int, float, or ndarray) – An action sampled from the distribution over actions defined by the epsilon-soft policy.
greedy_policy(max_steps, render=True)[source]

Execute a deterministic greedy policy using the current agent parameters.

Parameters: max_steps (int) – The maximum number of steps to run the episode. render (bool) – Whether to render the episode during execution. total_reward (float) – The total reward on the episode. n_steps (float) – The total number of steps taken on the episode.
flush_history()[source]

Clear the episode history