Policies¶
BanditPolicyBase
¶

class
numpy_ml.bandits.policies.
BanditPolicyBase
[source]¶ A simple base class for multiarmed bandit policies
EpsilonGreedy
¶

class
numpy_ml.bandits.policies.
EpsilonGreedy
(epsilon=0.05, ev_prior=0.5)[source]¶ Bases:
numpy_ml.bandits.policies.BanditPolicyBase
An epsilongreedy policy for multiarmed bandit problems.
Notes
Epsilongreedy policies greedily select the arm with the highest expected payoff with probability \(1\epsilon\), and selects an arm uniformly at random with probability \(\epsilon\):
\[\begin{split}P(a) = \left\{ \begin{array}{lr} \epsilon / N + (1  \epsilon) &\text{if } a = \arg \max_{a' \in \mathcal{A}} \mathbb{E}_{q_{\hat{\theta}}}[r \mid a']\\ \epsilon / N &\text{otherwise} \end{array} \right.\end{split}\]where \(N = \mathcal{A}\) is the number of arms, \(q_{\hat{\theta}}\) is the estimate of the arm payoff distribution under current model parameters \(\hat{\theta}\), and \(\mathbb{E}_{q_{\hat{\theta}}}[r \mid a']\) is the expected reward under \(q_{\hat{\theta}}\) of receiving reward r after taking action \(a'\).
Parameters:  epsilon (float in [0, 1]) – The probability of taking a random action. Default is 0.05.
 ev_prior (float) – The starting expected payoff for each arm before any data has been observed. Default is 0.5.
UCB1
¶

class
numpy_ml.bandits.policies.
UCB1
(C=1, ev_prior=0.5)[source]¶ Bases:
numpy_ml.bandits.policies.BanditPolicyBase
A UCB1 policy for multiarmed bandit problems.
Notes
The UCB1 algorithm [*] guarantees the cumulative regret is bounded by log t, where t is the current timestep. To make this guarantee UCB1 assumes all arm payoffs are between 0 and 1.
Under UCB1, the upper confidence bound on the expected value for pulling arm a at timestep t is:
\[\text{UCB}(a, t) = \text{EV}_t(a) + C \sqrt{\frac{2 \log t}{N_t(a)}}\]where \(\text{EV}_t(a)\) is the average of the rewards recieved so far from pulling arm a, C is a free parameter controlling the “optimism” of the confidence upper bound for \(\text{UCB}(a, t)\) (for logarithmic regret bounds, C must equal 1), and \(N_t(a)\) is the number of times arm a has been pulled during the previous t  1 timesteps.
References
[*] Auer, P., CesaBianchi, N., & Fischer, P. (2002). Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2). Parameters:  C (float in (0, +infinity)) – A confidence/optimisim parameter affecting the degree of exploration, where larger values encourage greater exploration. The UCB1 algorithm assumes C=1. Default is 1.
 ev_prior (float) – The starting expected value for each arm before any data has been observed. Default is 0.5.
ThompsonSamplingBetaBinomial
¶

class
numpy_ml.bandits.policies.
ThompsonSamplingBetaBinomial
(alpha=1, beta=1)[source]¶ Bases:
numpy_ml.bandits.policies.BanditPolicyBase
A conjugate Thompson sampling [1] [2] policy for multiarmed bandits with Bernoulli likelihoods.
Notes
The policy assumes independent Beta priors on the Bernoulli arm payoff probabilities, \(\theta\):
\[\begin{split}\theta_k \sim \text{Beta}(\alpha_k, \beta_k) \\ r \mid \theta_k \sim \text{Bernoulli}(\theta_k)\end{split}\]where \(k \in \{1,\ldots,K \}\) indexes arms in the MAB and \(\theta_k\) is the parameter of the Bernoulli likelihood for arm k. The sampler begins by selecting an arm with probability proportional to its payoff probability under the initial Beta prior. After pulling the sampled arm and receiving a reward, r, the sampler computes the posterior over the model parameters (arm payoffs) via Bayes’ rule, and then samples a new action in proportion to its payoff probability under this posterior. This process (i.e., sample action from posterior, take action and receive reward, compute updated posterior) is repeated until the number of trials is exhausted.
Note that due to the conjugacy between the Beta prior and Bernoulli likelihood the posterior for each arm will also be Betadistributed and can computed and sampled from efficiently:
\[\theta_k \mid r \sim \text{Beta}(\alpha_k + r, \beta_k + 1  r)\]References
[1] Thompson, W. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285294. [2] Chapelle, O., & Li, L. (2011). An empirical evaluation of Thompson sampling. Advances in Neural Information Processing Systems, 24, 22492257. Parameters:  alpha (float or list of length K) – Parameter for the Beta prior on arm payouts. If a float, this value will be used in the prior for all of the K arms.
 beta (float or list of length K) – Parameter for the Beta prior on arm payouts. If a float, this value will be used in the prior for all of the K arms.
LinUCB
¶

class
numpy_ml.bandits.policies.
LinUCB
(alpha=1)[source]¶ Bases:
numpy_ml.bandits.policies.BanditPolicyBase
A disjoint linear UCB policy [†] for contextual linear bandits.
Notes
LinUCB is only defined for
ContextualLinearBandit
environments.References
[†] Li, L., Chu, W., Langford, J., & Schapire, R. (2010). A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, 661670. Parameters: alpha (float) – A confidence/optimisim parameter affecting the amount of exploration. Default is 1.