Digital signal processing

DCT

numpy_ml.preprocessing.dsp.DCT(frame, orthonormal=True)[source]

A naive \(O(N^2)\) implementation of the 1D discrete cosine transform-II (DCT-II).

Notes

For a signal \(\mathbf{x} = [x_1, \ldots, x_N]\) consisting of N samples, the k th DCT coefficient, \(c_k\), is

\[c_k = 2 \sum_{n=0}^{N-1} x_n \cos(\pi k (2 n + 1) / (2 N))\]

where k ranges from \(0, \ldots, N-1\).

The DCT is highly similar to the DFT – whereas in a DFT the basis functions are sinusoids, in a DCT they are restricted solely to cosines. A signal’s DCT representation tends to have more of its energy concentrated in a smaller number of coefficients when compared to the DFT, and is thus commonly used for signal compression. [1]

[1]Smoother signals can be accurately approximated using fewer DFT / DCT coefficients, resulting in a higher compression ratio. The DCT naturally yields a continuous extension at the signal boundaries due its use of even basis functions (cosine). This in turn produces a smoother extension in comparison to DFT or DCT approximations, resulting in a higher compression.
Parameters:
  • frame (ndarray of shape (N,)) – A signal frame consisting of N samples
  • orthonormal (bool) – Scale to ensure the coefficient vector is orthonormal. Default is True.
Returns:

dct (ndarray of shape (N,)) – The discrete cosine transform of the samples in frame.

DFT

numpy_ml.preprocessing.dsp.DFT(frame, positive_only=True)[source]

A naive \(O(N^2)\) implementation of the 1D discrete Fourier transform (DFT).

Notes

The Fourier transform decomposes a signal into a linear combination of sinusoids (ie., basis elements in the space of continuous periodic functions). For a sequence \(\mathbf{x} = [x_1, \ldots, x_N]\) of N evenly spaced samples, the k th DFT coefficient is given by:

\[c_k = \sum_{n=0}^{N-1} x_n \exp(-2 \pi i k n / N)\]

where i is the imaginary unit, k is an index ranging from 0, …, N-1, and \(X_k\) is the complex coefficient representing the phase (imaginary part) and amplitude (real part) of the k th sinusoid in the DFT spectrum. The frequency of the k th sinusoid is \((k 2 \pi / N)\) radians per sample.

When applied to a real-valued input, the negative frequency terms are the complex conjugates of the positive-frequency terms and the overall spectrum is symmetric (excluding the first index, which contains the zero-frequency / intercept term).

Parameters:
  • frame (ndarray of shape (N,)) – A signal frame consisting of N samples
  • positive_only (bool) – Whether to only return the coefficients for the positive frequency terms. Default is True.
Returns:

spectrum (ndarray of shape (N,) or (N // 2 + 1,) if real_only) – The coefficients of the frequency spectrum for frame, including imaginary components.

dft_bins

numpy_ml.preprocessing.dsp.dft_bins(N, fs=44000, positive_only=True)[source]

Calc the frequency bin centers for a DFT with N coefficients.

Parameters:
  • N (int) – The number of frequency bins in the DFT
  • fs (int) – The sample rate/frequency of the signal (in Hz). Default is 44000.
  • positive_only (bool) – Whether to only return the bins for the positive frequency terms. Default is True.
Returns:

bins (ndarray of shape (N,) or (N // 2 + 1,) if positive_only) – The frequency bin centers associated with each coefficient in the DFT spectrum

magnitude_spectrum

numpy_ml.preprocessing.dsp.magnitude_spectrum(frames)[source]

Compute the magnitude spectrum (i.e., absolute value of the DFT spectrum) for each frame in frames. Assumes each frame is real-valued only.

Parameters:frames (ndarray of shape (M, N)) – A sequence of M frames each consisting of N samples
Returns:magnitude_spec (ndarray of shape (M, N // 2 + 1)) – The magnitude spectrum for each frame in frames. Only includes the coefficients for the positive spectrum frequencies.

power_spectrum

numpy_ml.preprocessing.dsp.power_spectrum(frames, scale=False)[source]

Compute the power spectrum for a signal represented as a collection of frames. Assumes each frame is real-valued only.

The power spectrum is simply the square of the magnitude spectrum, possibly scaled by the number of FFT bins. It measures how the energy of the signal is distributed over the frequency domain.

Parameters:
  • frames (ndarray of shape (M, N)) – A sequence of M frames each consisting of N samples
  • scale (bool) – Whether the scale by the number of DFT bins. Default is False.
Returns:

power_spec (ndarray of shape (M, N // 2 + 1)) – The power spectrum for each frame in frames. Only includes the coefficients for the positive spectrum frequencies.

batch_resample

numpy_ml.preprocessing.dsp.batch_resample(X, new_dim, mode='bilinear')[source]

Resample each image (or similar grid-based 2D signal) in a batch to new_dim using the specified resampling strategy.

Parameters:
  • X (ndarray of shape (n_ex, in_rows, in_cols, in_channels)) – An input image volume
  • new_dim (2-tuple of (out_rows, out_cols)) – The dimension to resample each image to
  • mode ({'bilinear', 'neighbor'}) – The resampling strategy to employ. Default is ‘bilinear’.
Returns:

resampled (ndarray of shape (n_ex, out_rows, out_cols, in_channels)) – The resampled image volume.

nn_interpolate_2D

numpy_ml.preprocessing.dsp.nn_interpolate_2D(X, x, y)[source]

Estimates of the pixel values at the coordinates (x, y) in X using a nearest neighbor interpolation strategy.

Notes

Assumes the current entries in X reflect equally-spaced samples from a 2D integer grid.

Parameters:
  • X (ndarray of shape (in_rows, in_cols, in_channels)) – An input image sampled along a grid of in_rows by in_cols.
  • x (list of length k) – A list of x-coordinates for the samples we wish to generate
  • y (list of length k) – A list of y-coordinates for the samples we wish to generate
Returns:

samples (ndarray of shape (k, in_channels)) – The samples for each (x,y) coordinate computed via nearest neighbor interpolation

nn_interpolate_1D

numpy_ml.preprocessing.dsp.nn_interpolate_1D(X, t)[source]

Estimates of the signal values at X[t] using a nearest neighbor interpolation strategy.

Parameters:
  • X (ndarray of shape (in_length, in_channels)) – An input image sampled along an integer in_length
  • t (list of length k) – A list of coordinates for the samples we wish to generate
Returns:

samples (ndarray of shape (k, in_channels)) – The samples for each (x,y) coordinate computed via nearest neighbor interpolation

bilinear_interpolate

numpy_ml.preprocessing.dsp.bilinear_interpolate(X, x, y)[source]

Estimates of the pixel values at the coordinates (x, y) in X via bilinear interpolation.

Notes

Assumes the current entries in X reflect equally-spaced samples from a 2D integer grid.

Modified from https://bit.ly/2NMb1Dr

Parameters:
  • X (ndarray of shape (in_rows, in_cols, in_channels)) – An input image sampled along a grid of in_rows by in_cols.
  • x (list of length k) – A list of x-coordinates for the samples we wish to generate
  • y (list of length k) – A list of y-coordinates for the samples we wish to generate
Returns:

samples (list of length (k, in_channels)) – The samples for each (x,y) coordinate computed via bilinear interpolation

to_frames

numpy_ml.preprocessing.dsp.to_frames(x, frame_width, stride, writeable=False)[source]

Convert a 1D signal x into overlapping windows of width frame_width using a hop length of stride.

Notes

If (len(x) - frame_width) % stride != 0 then some number of the samples in x will be dropped. Specifically:

n_dropped_frames = len(x) - frame_width - stride * (n_frames - 1)

where:

n_frames = (len(x) - frame_width) // stride + 1

This method uses low-level stride manipulation to avoid creating an additional copy of x. The downside is that if writeable`=True, modifying the frame output can result in unexpected behavior:

>>> out = to_frames(np.arange(6), 5, 1)
>>> out
array([[0, 1, 2, 3, 4],
       [1, 2, 3, 4, 5]])
>>> out[0, 1] = 99
>>> out
array([[ 0, 99,  2,  3,  4],
       [99,  2,  3,  4,  5]])
Parameters:
  • x (ndarray of shape (N,)) – A 1D signal consisting of N samples
  • frame_width (int) – The width of a single frame window in samples
  • stride (int) – The hop size / number of samples advanced between consecutive frames
  • writeable (bool) – If set to False, the returned array will be readonly. Otherwise it will be writable if x was. It is advisable to set this to False whenever possible to avoid unexpected behavior (see NB 2 above). Default is False.
Returns:

frame (ndarray of shape (n_frames, frame_width)) – The collection of overlapping frames stacked into a matrix

autocorrelate1D

numpy_ml.preprocessing.dsp.autocorrelate1D(x)[source]

Autocorrelate a 1D signal x with itself.

Notes

The k th term in the 1 dimensional autocorrelation is

\[a_k = \sum_n x_{n + k} x_n\]

NB. This is a naive \(O(N^2)\) implementation. For a faster \(O(N \log N)\) approach using the FFT, see [1].

References

[1]https://en.wikipedia.org/wiki/Autocorrelation#Efficient%computation
Parameters:x (ndarray of shape (N,)) – A 1D signal consisting of N samples
Returns:auto (ndarray of shape (N,)) – The autocorrelation of x with itself

preemphasis

numpy_ml.preprocessing.dsp.preemphasis(x, alpha)[source]

Increase the amplitude of high frequency bands + decrease the amplitude of lower bands.

Notes

Preemphasis filtering is (was?) a common transform in speech processing, where higher frequencies tend to be more useful during signal disambiguation.

\[\text{preemphasis}( x_t ) = x_t - \alpha x_{t-1}\]
Parameters:
  • x (ndarray of shape (N,)) – A 1D signal consisting of N samples
  • alpha (float in [0, 1)) – The preemphasis coefficient. A value of 0 corresponds to no filtering
Returns:

out (ndarray of shape (N,)) – The filtered signal

cepstral_lifter

numpy_ml.preprocessing.dsp.cepstral_lifter(mfccs, D)[source]

A simple sinusoidal filter applied in the Mel-frequency domain.

Notes

Cepstral lifting helps to smooth the spectral envelope and dampen the magnitude of the higher MFCC coefficients while keeping the other coefficients unchanged. The filter function is:

\[\text{lifter}( x_n ) = x_n \left(1 + \frac{D \sin(\pi n / D)}{2}\right)\]
Parameters:
  • mfccs (ndarray of shape (G, C)) – Matrix of Mel cepstral coefficients. Rows correspond to frames, columns to cepstral coefficients
  • D (int in \([0, +\infty]\)) – The filter coefficient. 0 corresponds to no filtering, larger values correspond to greater amounts of smoothing
Returns:

out (ndarray of shape (G, C)) – The lifter’d MFCC coefficients

mel_spectrogram

numpy_ml.preprocessing.dsp.mel_spectrogram(x, window_duration=0.025, stride_duration=0.01, mean_normalize=True, window='hamming', n_filters=20, center=True, alpha=0.95, fs=44000)[source]

Apply the Mel-filterbank to the power spectrum for a signal x.

Notes

The Mel spectrogram is the projection of the power spectrum of the framed and windowed signal onto the basis set provided by the Mel filterbank.

Parameters:
  • x (ndarray of shape (N,)) – A 1D signal consisting of N samples
  • window_duration (float) – The duration of each frame / window (in seconds). Default is 0.025.
  • stride_duration (float) – The duration of the hop between consecutive windows (in seconds). Default is 0.01.
  • mean_normalize (bool) – Whether to subtract the coefficient means from the final filter values to improve the signal-to-noise ratio. Default is True.
  • window ({'hamming', 'hann', 'blackman_harris'}) – The windowing function to apply to the signal before FFT. Default is ‘hamming’.
  • n_filters (int) – The number of mel filters to include in the filterbank. Default is 20.
  • center (bool) – Whether to the k th frame of the signal should begin at index x[k * stride_len] (center = False) or be centered at x[k * stride_len] (center = True). Default is False.
  • alpha (float in [0, 1)) – The coefficient for the preemphasis filter. A value of 0 corresponds to no filtering. Default is 0.95.
  • fs (int) – The sample rate/frequency for the signal. Default is 44000.
Returns:

  • filter_energies (ndarray of shape (G, n_filters)) – The (possibly mean_normalized) power for each filter in the Mel filterbank (i.e., the Mel spectrogram). Rows correspond to frames, columns to filters
  • energy_per_frame (ndarray of shape (G,)) – The total energy in each frame of the signal

mfcc

numpy_ml.preprocessing.dsp.mfcc(x, fs=44000, n_mfccs=13, alpha=0.95, center=True, n_filters=20, window='hann', normalize=True, lifter_coef=22, stride_duration=0.01, window_duration=0.025, replace_intercept=True)[source]

Compute the Mel-frequency cepstral coefficients (MFCC) for a signal.

Notes

Computing MFCC features proceeds in the following stages:

  1. Convert the signal into overlapping frames and apply a window fn
  2. Compute the power spectrum at each frame
  3. Apply the mel filterbank to the power spectra to get mel filterbank powers
  4. Take the logarithm of the mel filterbank powers at each frame
  5. Take the discrete cosine transform (DCT) of the log filterbank energies and retain only the first k coefficients to further reduce the dimensionality

MFCCs were developed in the context of HMM-GMM automatic speech recognition (ASR) systems and can be used to provide a somewhat speaker/pitch invariant representation of phonemes.

Parameters:
  • x (ndarray of shape (N,)) – A 1D signal consisting of N samples
  • fs (int) – The sample rate/frequency for the signal. Default is 44000.
  • n_mfccs (int) – The number of cepstral coefficients to return (including the intercept coefficient). Default is 13.
  • alpha (float in [0, 1)) – The preemphasis coefficient. A value of 0 corresponds to no filtering. Default is 0.95.
  • center (bool) – Whether to the kth frame of the signal should begin at index x[k * stride_len] (center = False) or be centered at x[k * stride_len] (center = True). Default is True.
  • n_filters (int) – The number of filters to include in the Mel filterbank. Default is 20.
  • normalize (bool) – Whether to mean-normalize the MFCC values. Default is True.
  • lifter_coef (int in :math:[0, + infty]`) – The cepstral filter coefficient. 0 corresponds to no filtering, larger values correspond to greater amounts of smoothing. Default is 22.
  • window ({'hamming', 'hann', 'blackman_harris'}) – The windowing function to apply to the signal before taking the DFT. Default is ‘hann’.
  • stride_duration (float) – The duration of the hop between consecutive windows (in seconds). Default is 0.01.
  • window_duration (float) – The duration of each frame / window (in seconds). Default is 0.025.
  • replace_intercept (bool) – Replace the first MFCC coefficient (the intercept term) with the log of the total frame energy instead. Default is True.
Returns:

mfccs (ndarray of shape (G, C)) – Matrix of Mel-frequency cepstral coefficients. Rows correspond to frames, columns to cepstral coefficients

mel2hz

numpy_ml.preprocessing.dsp.mel2hz(mel, formula='htk')[source]

Convert the mel-scale representation of a signal into Hz

Parameters:
  • mel (ndarray of shape (N, *)) – An array of mel frequencies to convert
  • formula ({"htk", "slaney"}) – The Mel formula to use. “htk” uses the formula used by the Hidden Markov Model Toolkit, and described in O’Shaughnessy (1987). “slaney” uses the formula used in the MATLAB auditory toolbox (Slaney, 1998). Default is ‘htk’
Returns:

hz (ndarray of shape (N, *)) – The frequencies of the items in mel, in Hz

hz2mel

numpy_ml.preprocessing.dsp.hz2mel(hz, formula='htk')[source]

Convert the frequency representaiton of a signal in Hz into the mel scale.

Parameters:
  • hz (ndarray of shape (N, *)) – The frequencies of the items in mel, in Hz
  • formula ({"htk", "slaney"}) – The Mel formula to use. “htk” uses the formula used by the Hidden Markov Model Toolkit, and described in O’Shaughnessy (1987). “slaney” uses the formula used in the MATLAB auditory toolbox (Slaney, 1998). Default is ‘htk’.
Returns:

mel (ndarray of shape (N, *)) – An array of mel frequencies to convert.

mel_filterbank

numpy_ml.preprocessing.dsp.mel_filterbank(N, n_filters=20, fs=44000, min_freq=0, max_freq=None, normalize=True)[source]

Compute the filters in a Mel filterbank and return the corresponding transformation matrix

Notes

The Mel scale is a perceptual scale designed to simulate the way the human ear works. Pitches judged by listeners to be equal in perceptual / psychological distance have equal distance on the Mel scale. Practically, this corresponds to a scale with higher resolution at low frequencies and lower resolution at higher (> 500 Hz) frequencies.

Each filter in the Mel filterbank is triangular with a response of 1 at its center and a linear decay on both sides until it reaches the center frequency of the next adjacent filter.

This implementation is based on code in the (superb) LibROSA package [1].

References

[1]McFee et al. (2015). “librosa: Audio and music signal analysis in Python”, Proceedings of the 14th Python in Science Conference https://librosa.github.io
Parameters:
  • N (int) – The number of DFT bins
  • n_filters (int) – The number of mel filters to include in the filterbank. Default is 20.
  • min_freq (int) – Minimum filter frequency (in Hz). Default is 0.
  • max_freq (int) – Maximum filter frequency (in Hz). Default is 0.
  • fs (int) – The sample rate/frequency for the signal. Default is 44000.
  • normalize (bool) – If True, scale the Mel filter weights by their area in Mel space. Default is True.
Returns:

fbank (ndarray of shape (n_filters, N // 2 + 1)) – The mel-filterbank transformation matrix. Rows correspond to filters, columns to DFT bins.