# Policies¶

## BanditPolicyBase¶

class numpy_ml.bandits.policies.BanditPolicyBase[source]

A simple base class for multi-armed bandit policies

hyperparameters[source]

A dictionary containing the policy hyperparameters

parameters[source]

A dictionary containing the current policy parameters

act(bandit, context=None)[source]

Select an arm and sample from its payoff distribution.

Parameters: bandit (Bandit instance) – The multi-armed bandit to act upon context (ndarray of shape (D,) or None) – The context vector for the current timestep if interacting with a contextual bandit. Otherwise, this argument is unused. Default is None. rwd (float) – The reward received after pulling arm_id. arm_id (int) – The arm that was pulled to generate rwd.
reset()[source]

Reset the policy parameters and counters to their initial states.

## EpsilonGreedy¶

class numpy_ml.bandits.policies.EpsilonGreedy(epsilon=0.05, ev_prior=0.5)[source]

An epsilon-greedy policy for multi-armed bandit problems.

Notes

Epsilon-greedy policies greedily select the arm with the highest expected payoff with probability $$1-\epsilon$$, and selects an arm uniformly at random with probability $$\epsilon$$:

$\begin{split}P(a) = \left\{ \begin{array}{lr} \epsilon / N + (1 - \epsilon) &\text{if } a = \arg \max_{a' \in \mathcal{A}} \mathbb{E}_{q_{\hat{\theta}}}[r \mid a']\\ \epsilon / N &\text{otherwise} \end{array} \right.\end{split}$

where $$N = |\mathcal{A}|$$ is the number of arms, $$q_{\hat{\theta}}$$ is the estimate of the arm payoff distribution under current model parameters $$\hat{\theta}$$, and $$\mathbb{E}_{q_{\hat{\theta}}}[r \mid a']$$ is the expected reward under $$q_{\hat{\theta}}$$ of receiving reward r after taking action $$a'$$.

Parameters: epsilon (float in [0, 1]) – The probability of taking a random action. Default is 0.05. ev_prior (float) – The starting expected payoff for each arm before any data has been observed. Default is 0.5.
parameters[source]

A dictionary containing the current policy parameters

hyperparameters[source]

A dictionary containing the policy hyperparameters

## UCB1¶

class numpy_ml.bandits.policies.UCB1(C=1, ev_prior=0.5)[source]

A UCB1 policy for multi-armed bandit problems.

Notes

The UCB1 algorithm [*] guarantees the cumulative regret is bounded by log t, where t is the current timestep. To make this guarantee UCB1 assumes all arm payoffs are between 0 and 1.

Under UCB1, the upper confidence bound on the expected value for pulling arm a at timestep t is:

$\text{UCB}(a, t) = \text{EV}_t(a) + C \sqrt{\frac{2 \log t}{N_t(a)}}$

where $$\text{EV}_t(a)$$ is the average of the rewards recieved so far from pulling arm a, C is a free parameter controlling the “optimism” of the confidence upper bound for $$\text{UCB}(a, t)$$ (for logarithmic regret bounds, C must equal 1), and $$N_t(a)$$ is the number of times arm a has been pulled during the previous t - 1 timesteps.

References

 [*] Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2).
Parameters: C (float in (0, +infinity)) – A confidence/optimisim parameter affecting the degree of exploration, where larger values encourage greater exploration. The UCB1 algorithm assumes C=1. Default is 1. ev_prior (float) – The starting expected value for each arm before any data has been observed. Default is 0.5.
parameters[source]

A dictionary containing the current policy parameters

hyperparameters[source]

A dictionary containing the policy hyperparameters

## ThompsonSamplingBetaBinomial¶

class numpy_ml.bandits.policies.ThompsonSamplingBetaBinomial(alpha=1, beta=1)[source]

A conjugate Thompson sampling   policy for multi-armed bandits with Bernoulli likelihoods.

Notes

The policy assumes independent Beta priors on the Bernoulli arm payoff probabilities, $$\theta$$:

$\begin{split}\theta_k \sim \text{Beta}(\alpha_k, \beta_k) \\ r \mid \theta_k \sim \text{Bernoulli}(\theta_k)\end{split}$

where $$k \in \{1,\ldots,K \}$$ indexes arms in the MAB and $$\theta_k$$ is the parameter of the Bernoulli likelihood for arm k. The sampler begins by selecting an arm with probability proportional to its payoff probability under the initial Beta prior. After pulling the sampled arm and receiving a reward, r, the sampler computes the posterior over the model parameters (arm payoffs) via Bayes’ rule, and then samples a new action in proportion to its payoff probability under this posterior. This process (i.e., sample action from posterior, take action and receive reward, compute updated posterior) is repeated until the number of trials is exhausted.

Note that due to the conjugacy between the Beta prior and Bernoulli likelihood the posterior for each arm will also be Beta-distributed and can computed and sampled from efficiently:

$\theta_k \mid r \sim \text{Beta}(\alpha_k + r, \beta_k + 1 - r)$

References

  Thompson, W. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285-294.
  Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson sampling. Advances in Neural Information Processing Systems, 24, 2249-2257.
Parameters: alpha (float or list of length K) – Parameter for the Beta prior on arm payouts. If a float, this value will be used in the prior for all of the K arms. beta (float or list of length K) – Parameter for the Beta prior on arm payouts. If a float, this value will be used in the prior for all of the K arms.
parameters[source]

A dictionary containing the current policy parameters

hyperparameters[source]

A dictionary containing the policy hyperparameters

## LinUCB¶

class numpy_ml.bandits.policies.LinUCB(alpha=1)[source]

A disjoint linear UCB policy [†] for contextual linear bandits.

Notes

LinUCB is only defined for ContextualLinearBandit environments.

References

 [†] Li, L., Chu, W., Langford, J., & Schapire, R. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, 661-670.
Parameters: alpha (float) – A confidence/optimisim parameter affecting the amount of exploration. Default is 1.
parameters[source]

A dictionary containing the current policy parameters

hyperparameters[source]

A dictionary containing the policy hyperparameters