General

FeatureHasher

class numpy_ml.preprocessing.general.FeatureHasher(n_dim=256, sparse=True)[source]

Convert a collection of features to a fixed-dimensional matrix using the hashing trick.

Notes

Uses the md5 hash.

Parameters:
  • n_dim (int) – The dimensionality of each example in the output feature matrix. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger overall parameter dimensions for any (linear) learning agent. Default is 256.
  • sparse (bool) – Whether the resulting feature matrix should be a sparse csr_matrix or dense ndarray. Default is True.
encode(examples)[source]

Encode a collection of multi-featured examples into a n_dim-dimensional feature matrix via feature hashing.

Notes

Feature hashing works by applying a hash function to the features of an example and using the hash values as column indices in the resulting feature matrix. The entries at each hashed feature column correspond to the values for that example and feature. For example, given the following two input examples:

>>> examples = [
    {"furry": 1, "quadruped": 1, "domesticated": 1},
    {"nocturnal": 1, "quadruped": 1},
]

and a hypothetical hash function H mapping strings to [0, 127], we have:

>>> feature_mat = zeros(2, 128)
>>> ex1_cols = [H("furry"), H("quadruped"), H("domesticated")]
>>> ex2_cols = [H("nocturnal"), H("quadruped")]
>>> feat_mat[0, ex1_cols] = 1
>>> feat_mat[1, ex2_cols] = 1

To better handle hash collisions, it is common to multiply the feature value by the sign of the digest for the corresponding feature name.

Parameters:examples (dict or list of dicts) – A collection of N examples, each represented as a dict where keys correspond to the feature name and values correspond to the feature value.
Returns:table (ndarray or csr_matrix of shape (N, n_dim)) – The encoded feature matrix

OneHotEncoder

class numpy_ml.preprocessing.general.OneHotEncoder[source]

Convert between category labels and their one-hot vector representations.

Parameters:categories (list of length C) – List of the unique category labels for the items to encode.
fit(categories)[source]

Create mappings between columns and category labels.

Parameters:categories (list of length C) – List of the unique category labels for the items to encode.
transform(labels, categories=None)[source]

Convert a list of labels into a one-hot encoding.

Parameters:
  • labels (list of length N) – A list of category labels.
  • categories (list of length C) – List of the unique category labels for the items to encode. Default is None.
Returns:

Y (ndarray of shape (N, C)) – The one-hot encoded labels. Each row corresponds to an example, with a single 1 in the column corresponding to the respective label.

inverse_transform(Y)[source]

Convert a one-hot encoding back into the corresponding labels

Parameters:Y (ndarray of shape (N, C)) – One-hot encoded labels. Each row corresponds to an example, with a single 1 in the column associated with the label for that example
Returns:labels (list of length N) – The list of category labels corresponding to the nonzero columns in Y

Standardizer

class numpy_ml.preprocessing.general.Standardizer(with_mean=True, with_std=True)[source]

Feature-wise standardization for vector inputs.

Notes

Due to the sensitivity of empirical mean and standard deviation calculations to extreme values, Standardizer cannot guarantee balanced feature scales in the presence of outliers. In particular, note that because outliers for each feature can have different magnitudes, the spread of the transformed data on each feature can be very different.

Similar to sklearn, Standardizer uses a biased estimator for the standard deviation: numpy.std(x, ddof=0).

Parameters:
  • with_mean (bool) – Whether to scale samples to have 0 mean during transformation. Default is True.
  • with_std (bool) – Whether to scale samples to have unit variance during transformation. Default is True.
hyperparameters[source]
parameters[source]
fit(X)[source]

Store the feature-wise mean and standard deviation across the samples in X for future scaling.

Parameters:X (ndarray of shape (N, C)) – An array of N samples, each with dimensionality C
transform(X)[source]

Standardize features by removing the mean and scaling to unit variance.

For a sample x, the standardized score is calculated as:

\[z = (x - u) / s\]

where u is the mean of the training samples or zero if with_mean is False, and s is the standard deviation of the training samples or 1 if with_std is False.

Parameters:X (ndarray of shape (N, C)) – An array of N samples, each with dimensionality C.
Returns:Z (ndarray of shape (N, C)) – The feature-wise standardized version of X.
inverse_transform(Z)[source]

Convert a collection of standardized features back into the original feature space.

For a standardized sample z, the unstandardized score is calculated as:

\[x = z s + u\]

where u is the mean of the training samples or zero if with_mean is False, and s is the standard deviation of the training samples or 1 if with_std is False.

Parameters:Z (ndarray of shape (N, C)) – An array of N standardized samples, each with dimensionality C.
Returns:X (ndarray of shape (N, C)) – The unstandardixed samples from Z.

minibatch

numpy_ml.preprocessing.general.minibatch(X, batchsize=256, shuffle=True)[source]

Compute the minibatch indices for a training dataset.

Parameters:
  • X (ndarray of shape (N, *)) – The dataset to divide into minibatches. Assumes the first dimension represents the number of training examples.
  • batchsize (int) – The desired size of each minibatch. Note, however, that if X.shape[0] % batchsize > 0 then the final batch will contain fewer than batchsize entries. Default is 256.
  • shuffle (bool) – Whether to shuffle the entries in the dataset before dividing into minibatches. Default is True.
Returns:

  • mb_generator (generator) – A generator which yields the indices into X for each batch.
  • n_batches (int) – The number of batches.