Data structures¶
BallTree
¶
-
class
numpy_ml.utils.data_structures.
BallTree
(leaf_size=40, metric=None)[source]¶ A ball tree data structure.
Notes
A ball tree is a binary tree in which every node defines a D-dimensional hypersphere (“ball”) containing a subset of the points to be searched. Each internal node of the tree partitions the data points into two disjoint sets which are associated with different balls. While the balls themselves may intersect, each point is assigned to one or the other ball in the partition according to its distance from the ball’s center. Each leaf node in the tree defines a ball and enumerates all data points inside that ball.
Parameters: - leaf_size (int) – The maximum number of datapoints at each leaf. Default is 40.
- metric (Distance metric or None) – The distance metric to use for computing nearest neighbors. If
None, use the
euclidean()
metric. Default is None.
References
[1] Omohundro, S. M. (1989). “Five balltree construction algorithms”. ICSI Technical Report TR-89-063. [2] Liu, T., Moore, A., & Gray A. (2006). “New algorithms for efficient high-dimensional nonparametric classification”. J. Mach. Learn. Res., 7, 1135-1158. -
fit
(X, y=None)[source]¶ Build a ball tree recursively using the O(M log N) k-d construction algorithm.
Notes
Recursively divides data into nodes defined by a centroid C and radius r such that each point below the node lies within the hyper-sphere defined by C and r.
Parameters:
-
nearest_neighbors
(k, x)[source]¶ Find the k nearest neighbors in the ball tree to a query vector x using the KNS1 algorithm.
Parameters: Returns: nearest (list of
PQNode
s of length k) – List of the k points in X to closest to the query vector. Thekey
attribute of eachPQNode
contains the point itself, theval
attribute contains its target, and thedistance
attribute contains its distance to the query vector.
DiscreteSampler
¶
-
class
numpy_ml.utils.data_structures.
DiscreteSampler
(probs, log=False, with_replacement=True)[source]¶ Sample from an arbitrary multinomial PMF over the first N nonnegative integers using Vose’s algorithm for the alias method.
Notes
Vose’s algorithm takes O(n) time to initialize, requires O(n) memory, and generates samples in constant time.
References
[1] Walker, A. J. (1977) “An efficient method for generating discrete random variables with general distributions”. ACM Transactions on Mathematical Software, 3(3), 253-256. [2] Vose, M. D. (1991) “A linear algorithm for generating random numbers with a given distribution”. IEEE Trans. Softw. Eng., 9, 972-974. [3] Schwarz, K (2011) “Darts, dice, and coins: sampling from a discrete distribution”. http://www.keithschwarz.com/darts-dice-coins/ Parameters: - probs (
ndarray
of length (N,)) – A list of probabilities of the N outcomes in the sample space. probs[i] returns the probability of outcome i. - log (bool) – Whether the probabilities in probs are in logspace. Default is False.
- with_replacement (bool) – Whether to generate samples with or without replacement. Default is True.
-
sample
(n_samples=1)[source]¶ Generate random draws from the probs distribution over integers in [0, N).
Parameters: n_samples (int) – The number of samples to generate. Default is 1. Returns: sample ( ndarray
of shape (n_samples,)) – A collection of draws from the distribution defined by probs. Each sample is an int in the range [0, N).
- probs (
PriorityQueue
¶
-
class
numpy_ml.utils.data_structures.
PriorityQueue
(capacity, heap_order='max')[source]¶ A priority queue implementation using a binary heap.
Notes
A priority queue is a data structure useful for storing the top capacity largest or smallest elements in a collection of values. As a result of using a binary heap,
PriorityQueue
offers O(log N)push()
andpop()
operations.Parameters: - capacity (int) – The maximum number of items that can be held in the queue.
- heap_order ({"max", "min"}) – Whether the priority queue should retain the items with the capacity smallest (heap_order = ‘min’) or capacity largest (heap_order = ‘max’) priorities.
-
push
(key, priority, val=None)[source]¶ Add a new (key, value) pair with priority priority to the queue.
Notes
If the queue is at capacity and priority exceeds the priority of the item with the largest/smallest priority currently in the queue, replace the current queue item with (key, val).
Parameters: - key (hashable object) – The key to insert into the queue.
- priority (comparable) – The priority for the key, val pair.
- val (object) – The value associated with key. Default is None.
PQNode
¶
-
class
numpy_ml.utils.data_structures.
PQNode
(key, val, priority, entry_id, **kwargs)[source]¶ A generic node object for holding entries in
PriorityQueue
Dict
¶
-
class
numpy_ml.utils.data_structures.
Dict
(encoder=None)[source]¶ Bases:
dict
A dictionary subclass which returns the key value if it is not in the dict.
Parameters: encoder (function or None) – A function which is applied to a key before adding / retrieving it from the dictionary. If None, the function defaults to the identity. Default is None.