Data structures


class numpy_ml.utils.data_structures.BallTree(leaf_size=40, metric=None)[source]

A ball tree data structure.


A ball tree is a binary tree in which every node defines a D-dimensional hypersphere (“ball”) containing a subset of the points to be searched. Each internal node of the tree partitions the data points into two disjoint sets which are associated with different balls. While the balls themselves may intersect, each point is assigned to one or the other ball in the partition according to its distance from the ball’s center. Each leaf node in the tree defines a ball and enumerates all data points inside that ball.

  • leaf_size (int) – The maximum number of datapoints at each leaf. Default is 40.
  • metric (Distance metric or None) – The distance metric to use for computing nearest neighbors. If None, use the euclidean() metric. Default is None.


[1]Omohundro, S. M. (1989). “Five balltree construction algorithms”. ICSI Technical Report TR-89-063.
[2]Liu, T., Moore, A., & Gray A. (2006). “New algorithms for efficient high-dimensional nonparametric classification”. J. Mach. Learn. Res., 7, 1135-1158.
fit(X, y=None)[source]

Build a ball tree recursively using the O(M log N) k-d construction algorithm.


Recursively divides data into nodes defined by a centroid C and radius r such that each point below the node lies within the hyper-sphere defined by C and r.

  • X (ndarray of shape (N, M)) – An array of N examples each with M features.
  • y (ndarray of shape (N, *) or None) – An array of target values / labels associated with the entries in X. Default is None.
nearest_neighbors(k, x)[source]

Find the k nearest neighbors in the ball tree to a query vector x using the KNS1 algorithm.

  • k (int) – The number of closest points in X to return
  • x (ndarray of shape (1, M)) – The query vector.

nearest (list of PQNode s of length k) – List of the k points in X to closest to the query vector. The key attribute of each PQNode contains the point itself, the val attribute contains its target, and the distance attribute contains its distance to the query vector.


class numpy_ml.utils.data_structures.DiscreteSampler(probs, log=False, with_replacement=True)[source]

Sample from an arbitrary multinomial PMF over the first N nonnegative integers using Vose’s algorithm for the alias method.


Vose’s algorithm takes O(n) time to initialize, requires O(n) memory, and generates samples in constant time.


[1]Walker, A. J. (1977) “An efficient method for generating discrete random variables with general distributions”. ACM Transactions on Mathematical Software, 3(3), 253-256.
[2]Vose, M. D. (1991) “A linear algorithm for generating random numbers with a given distribution”. IEEE Trans. Softw. Eng., 9, 972-974.
[3]Schwarz, K (2011) “Darts, dice, and coins: sampling from a discrete distribution”.
  • probs (ndarray of length (N,)) – A list of probabilities of the N outcomes in the sample space. probs[i] returns the probability of outcome i.
  • log (bool) – Whether the probabilities in probs are in logspace. Default is False.
  • with_replacement (bool) – Whether to generate samples with or without replacement. Default is True.

Generate random draws from the probs distribution over integers in [0, N).

Parameters:n_samples (int) – The number of samples to generate. Default is 1.
Returns:sample (ndarray of shape (n_samples,)) – A collection of draws from the distribution defined by probs. Each sample is an int in the range [0, N).


class numpy_ml.utils.data_structures.PriorityQueue(capacity, heap_order='max')[source]

A priority queue implementation using a binary heap.


A priority queue is a data structure useful for storing the top capacity largest or smallest elements in a collection of values. As a result of using a binary heap, PriorityQueue offers O(log N) push() and pop() operations.

  • capacity (int) – The maximum number of items that can be held in the queue.
  • heap_order ({"max", "min"}) – Whether the priority queue should retain the items with the capacity smallest (heap_order = ‘min’) or capacity largest (heap_order = ‘max’) priorities.
push(key, priority, val=None)[source]

Add a new (key, value) pair with priority priority to the queue.


If the queue is at capacity and priority exceeds the priority of the item with the largest/smallest priority currently in the queue, replace the current queue item with (key, val).

  • key (hashable object) – The key to insert into the queue.
  • priority (comparable) – The priority for the key, val pair.
  • val (object) – The value associated with key. Default is None.

Remove the item with the largest/smallest (depending on self.heap_order) priority from the queue and return it.


In contrast to peek(), this operation is O(log N).

Returns:item (PQNode instance or None) – Item with the largest/smallest priority, depending on self.heap_order.

Return the item with the largest/smallest (depending on self.heap_order) priority without removing it from the queue.


In contrast to pop(), this operation is O(1).

Returns:item (PQNode instance or None) – Item with the largest/smallest priority, depending on self.heap_order.


class numpy_ml.utils.data_structures.PQNode(key, val, priority, entry_id, **kwargs)[source]

A generic node object for holding entries in PriorityQueue


Return a dictionary representation of the node’s contents


class numpy_ml.utils.data_structures.Dict(encoder=None)[source]

Bases: dict

A dictionary subclass which returns the key value if it is not in the dict.

Parameters:encoder (function or None) – A function which is applied to a key before adding / retrieving it from the dictionary. If None, the function defaults to the identity. Default is None.