# Bandit Environments¶

## Bandit¶

class numpy_ml.bandits.bandits.Bandit(rewards, reward_probs, context=None)[source]
hyperparameters[source]

A dictionary of the bandit hyperparameters

oracle_payoff(context=None)[source]

Return the expected reward for an optimal agent.

Parameters: context (ndarray of shape (D, K) or None) – The current context matrix for each of the bandit arms, if applicable. Default is None. optimal_rwd (float) – The expected reward under an optimal policy.
pull(arm_id, context=None)[source]

“Pull” (i.e., sample from) a given arm’s payoff distribution.

Parameters: arm_id (int) – The integer ID of the arm to sample from context (ndarray of shape (D,) or None) – The context vector for the current timestep if this is a contextual bandit. Otherwise, this argument is unused and defaults to None. reward (float) – The reward sampled from the given arm’s payoff distribution
reset()[source]

Reset the bandit step and action counters to zero.

## MultinomialBandit¶

class numpy_ml.bandits.MultinomialBandit(payoffs, payoff_probs)[source]

A multi-armed bandit where each arm is associated with a different multinomial payoff distribution.

Parameters: payoffs (ragged list of length K) – The payoff values for each of the n bandits. payoffs[k][i] holds the i th payoff value for arm k. payoff_probs (ragged list of length K) – A list of the probabilities associated with each of the payoff values in payoffs. payoff_probs[k][i] holds the probability of payoff index i for arm k.
hyperparameters[source]

A dictionary of the bandit hyperparameters

oracle_payoff(context=None)[source]

Return the expected reward for an optimal agent.

Parameters: context (ndarray of shape (D, K) or None) – Unused. Default is None. optimal_rwd (float) – The expected reward under an optimal policy. optimal_arm (float) – The arm ID with the largest expected reward.

## BernoulliBandit¶

class numpy_ml.bandits.BernoulliBandit(payoff_probs)[source]

A multi-armed bandit where each arm is associated with an independent Bernoulli payoff distribution.

Parameters: payoff_probs (list of length K) – A list of the payoff probability for each arm. payoff_probs[k] holds the probability of payoff for arm k.
hyperparameters[source]

A dictionary of the bandit hyperparameters

oracle_payoff(context=None)[source]

Return the expected reward for an optimal agent.

Parameters: context (ndarray of shape (D, K) or None) – Unused. Default is None. optimal_rwd (float) – The expected reward under an optimal policy. optimal_arm (float) – The arm ID with the largest expected reward.

## GaussianBandit¶

class numpy_ml.bandits.GaussianBandit(payoff_dists, payoff_probs)[source]

A multi-armed bandit that is similar to BernoulliBandit, but instead of each arm having a fixed payout of 1, the payoff values are sampled from independent Gaussian RVs.

Parameters: payoff_dists (list of 2-tuples of length K) – The parameters the distributions over payoff values for each of the n arms. Specifically, payoffs[k] is a tuple of (mean, variance) for the Gaussian distribution over payoffs associated with arm k. payoff_probs (list of length n) – A list of the probabilities associated with each of the payoff values in payoffs. payoff_probs[k] holds the probability of payoff for arm k.
hyperparameters[source]

A dictionary of the bandit hyperparameters

oracle_payoff(context=None)[source]

Return the expected reward for an optimal agent.

Parameters: context (ndarray of shape (D, K) or None) – Unused. Default is None. optimal_rwd (float) – The expected reward under an optimal policy. optimal_arm (float) – The arm ID with the largest expected reward.

## ShortestPathBandit¶

class numpy_ml.bandits.ShortestPathBandit(G, start_vertex, end_vertex)[source]

A weighted graph shortest path problem formulated as a multi-armed bandit.

Notes

Each arm corresponds to a valid path through the graph from start to end vertex. The agent’s goal is to find the path that minimizes the expected sum of the weights on the edges it traverses.

Parameters: G (Graph instance) – A weighted graph object. Weights can be fixed or probabilistic. start_vertex (int) – The index of the path’s start vertex in the graph end_vertex (int) – The index of the path’s end vertex in the graph
hyperparameters[source]

A dictionary of the bandit hyperparameters

oracle_payoff(context=None)[source]

Return the expected reward for an optimal agent.

Parameters: context (ndarray of shape (D, K) or None) – Unused. Default is None. optimal_rwd (float) – The expected reward under an optimal policy. optimal_arm (float) – The arm ID with the largest expected reward.

## ContextualBernoulliBandit¶

class numpy_ml.bandits.ContextualBernoulliBandit(context_probs)[source]

A contextual version of BernoulliBandit where each binary context feature is associated with an independent Bernoulli payoff distribution.

Parameters: context_probs (ndarray of shape (D, K)) – A matrix of the payoff probabilities associated with each of the D context features, for each of the K arms. Index (i, j) contains the probability of payoff for arm j under context i.
hyperparameters[source]

A dictionary of the bandit hyperparameters

get_context()[source]

Sample a random one-hot context vector. This vector will be the same for all arms.

Returns: context (ndarray of shape (D, K)) – A random D-dimensional one-hot context vector repeated for each of the K bandit arms.
oracle_payoff(context)[source]

Return the expected reward for an optimal agent.

Parameters: context (ndarray of shape (D, K) or None) – The current context matrix for each of the bandit arms. optimal_rwd (float) – The expected reward under an optimal policy. optimal_arm (float) – The arm ID with the largest expected reward.

## ContextualLinearBandit¶

class numpy_ml.bandits.ContextualLinearBandit(K, D, payoff_variance=1)[source]

A contextual linear multi-armed bandit.

Notes

In a contextual linear bandit the expected payoff of an arm $$a \in \mathcal{A}$$ at time t is a linear combination of its context vector $$\mathbf{x}_{t,a}$$ with a coefficient vector $$\theta_a$$:

$\mathbb{E}[r_{t, a} \mid \mathbf{x}_{t, a}] = \mathbf{x}_{t,a}^\top \theta_a$

In this implementation, the arm coefficient vectors $$\theta$$ are initialized independently from a uniform distribution on the interval [-1, 1], and the specific reward at timestep t is normally distributed:

$r_{t, a} \mid \mathbf{x}_{t, a} \sim \mathcal{N}(\mathbf{x}_{t,a}^\top \theta_a, \sigma_a^2)$
Parameters: K (int) – The number of bandit arms D (int) – The dimensionality of the context vectors payoff_variance (float or ndarray of shape (K,)) – The variance of the random noise in the arm payoffs. If a float, the variance is assumed to be equal for each arm. Default is 1.
hyperparameters[source]

A dictionary of the bandit hyperparameters

parameters[source]

A dictionary of the current bandit parameters

get_context()[source]

Sample the context vectors for each arm from a multivariate standard normal distribution.

Returns: context (ndarray of shape (D, K)) – A D-dimensional context vector sampled from a standard normal distribution for each of the K bandit arms.
oracle_payoff(context)[source]

Return the expected reward for an optimal agent.

Parameters: context (ndarray of shape (D, K) or None) – The current context matrix for each of the bandit arms, if applicable. Default is None. optimal_rwd (float) – The expected reward under an optimal policy. optimal_arm (float) – The arm ID with the largest expected reward.