Trainer¶

BanditTrainer¶

class numpy_ml.bandits.trainer.BanditTrainer[source]

An object to facilitate multi-armed bandit training, comparison, and evaluation.

compare(policies, bandit, n_trials, n_duplicates, plot=True, seed=None, smooth_weight=0.999, out_dir=None)[source]

Compare the performance of multiple policies on the same bandit environment, generating a plot for each.

Parameters: policies (list of BanditPolicyBase instances) – The multi-armed bandit policies to compare. bandit (Bandit instance) – The environment to train the policies on. n_trials (int) – The number of trials per run. n_duplicates (int) – The number of times to evaluate each policy on the bandit environment. Larger values permit a better estimate of the variance in payoff / cumulative regret for each policy. plot (bool) – Whether to generate a plot of the policy’s average reward and regret across the episodes. Default is True. seed (int) – The seed for the random number generator. Default is None. smooth_weight (float in [0, 1]) – The smoothing weight. Values closer to 0 result in less smoothing, values closer to 1 produce more aggressive smoothing. Default is 0.999. out_dir (str or None) – Plots will be saved to this directory if plot is True. If out_dir is None, plots will not be saved. Default is None.
train(policy, bandit, n_trials, n_duplicates, plot=True, axes=None, verbose=True, print_every=100, smooth_weight=0.999, out_dir=None)[source]

Train a MAB policies on a multi-armed bandit problem, logging training statistics along the way.

Parameters: policy (BanditPolicyBase instance) – The multi-armed bandit policy to train. bandit (Bandit instance) – The environment to run the policy on. n_trials (int) – The number of trials per run. n_duplicates (int) – The number of runs to evaluate plot (bool) – Whether to generate a plot of the policy’s average reward and regret across the episodes. Default is True. axes (list of Axis instances or None) – If not None and plot = True, these are the axes that will be used to plot the cumulative reward and regret, respectively. Default is None. verbose (boolean) – Whether to print run statistics during training. Default is True. print_every (int) – The number of episodes to run before printing loss values to stdout. This is ignored if verbose is false. Default is 100. smooth_weight (float in [0, 1]) – The smoothing weight. Values closer to 0 result in less smoothing, values closer to 1 produce more aggressive smoothing. Default is 0.999. out_dir (str or None) – Plots will be saved to this directory if plot is True. If out_dir is None, plots will not be saved. Default is None. policy (BanditPolicyBase instance) – The policy trained during the last (i.e. most recent) duplicate run.
init_logs(policies)[source]

Initialize the episode logs.

Notes

Training logs are represented as a nested set of dictionaries with the following structure:

log[model_id][metric][trial_number][duplicate_number]

For example, logs['model1']['regret'][3][1] holds the regret value accrued on the 3rd trial of the 2nd duplicate run for model1.

Available fields are ‘regret’, ‘cregret’ (cumulative regret), ‘reward’, ‘mse’ (mean-squared error between estimated arm EVs and the true EVs), ‘optimal_arm’, ‘selected_arm’, and ‘optimal_reward’.