Digital signal processing¶
DCT
¶
-
numpy_ml.preprocessing.dsp.
DCT
(frame, orthonormal=True)[source]¶ A naive \(O(N^2)\) implementation of the 1D discrete cosine transform-II (DCT-II).
Notes
For a signal \(\mathbf{x} = [x_1, \ldots, x_N]\) consisting of N samples, the k th DCT coefficient, \(c_k\), is
\[c_k = 2 \sum_{n=0}^{N-1} x_n \cos(\pi k (2 n + 1) / (2 N))\]where k ranges from \(0, \ldots, N-1\).
The DCT is highly similar to the DFT – whereas in a DFT the basis functions are sinusoids, in a DCT they are restricted solely to cosines. A signal’s DCT representation tends to have more of its energy concentrated in a smaller number of coefficients when compared to the DFT, and is thus commonly used for signal compression. [1]
[1] Smoother signals can be accurately approximated using fewer DFT / DCT coefficients, resulting in a higher compression ratio. The DCT naturally yields a continuous extension at the signal boundaries due its use of even basis functions (cosine). This in turn produces a smoother extension in comparison to DFT or DCT approximations, resulting in a higher compression. Parameters: Returns: dct (
ndarray
of shape (N,)) – The discrete cosine transform of the samples in frame.
DFT
¶
-
numpy_ml.preprocessing.dsp.
DFT
(frame, positive_only=True)[source]¶ A naive \(O(N^2)\) implementation of the 1D discrete Fourier transform (DFT).
Notes
The Fourier transform decomposes a signal into a linear combination of sinusoids (ie., basis elements in the space of continuous periodic functions). For a sequence \(\mathbf{x} = [x_1, \ldots, x_N]\) of N evenly spaced samples, the k th DFT coefficient is given by:
\[c_k = \sum_{n=0}^{N-1} x_n \exp(-2 \pi i k n / N)\]where i is the imaginary unit, k is an index ranging from 0, …, N-1, and \(X_k\) is the complex coefficient representing the phase (imaginary part) and amplitude (real part) of the k th sinusoid in the DFT spectrum. The frequency of the k th sinusoid is \((k 2 \pi / N)\) radians per sample.
When applied to a real-valued input, the negative frequency terms are the complex conjugates of the positive-frequency terms and the overall spectrum is symmetric (excluding the first index, which contains the zero-frequency / intercept term).
Parameters: Returns: spectrum (
ndarray
of shape (N,) or (N // 2 + 1,) if real_only) – The coefficients of the frequency spectrum for frame, including imaginary components.
dft_bins
¶
-
numpy_ml.preprocessing.dsp.
dft_bins
(N, fs=44000, positive_only=True)[source]¶ Calc the frequency bin centers for a DFT with N coefficients.
Parameters: Returns: bins (
ndarray
of shape (N,) or (N // 2 + 1,) if positive_only) – The frequency bin centers associated with each coefficient in the DFT spectrum
magnitude_spectrum
¶
-
numpy_ml.preprocessing.dsp.
magnitude_spectrum
(frames)[source]¶ Compute the magnitude spectrum (i.e., absolute value of the DFT spectrum) for each frame in frames. Assumes each frame is real-valued only.
Parameters: frames ( ndarray
of shape (M, N)) – A sequence of M frames each consisting of N samplesReturns: magnitude_spec ( ndarray
of shape (M, N // 2 + 1)) – The magnitude spectrum for each frame in frames. Only includes the coefficients for the positive spectrum frequencies.
power_spectrum
¶
-
numpy_ml.preprocessing.dsp.
power_spectrum
(frames, scale=False)[source]¶ Compute the power spectrum for a signal represented as a collection of frames. Assumes each frame is real-valued only.
The power spectrum is simply the square of the magnitude spectrum, possibly scaled by the number of FFT bins. It measures how the energy of the signal is distributed over the frequency domain.
Parameters: Returns: power_spec (
ndarray
of shape (M, N // 2 + 1)) – The power spectrum for each frame in frames. Only includes the coefficients for the positive spectrum frequencies.
batch_resample
¶
-
numpy_ml.preprocessing.dsp.
batch_resample
(X, new_dim, mode='bilinear')[source]¶ Resample each image (or similar grid-based 2D signal) in a batch to new_dim using the specified resampling strategy.
Parameters: - X (
ndarray
of shape (n_ex, in_rows, in_cols, in_channels)) – An input image volume - new_dim (2-tuple of (out_rows, out_cols)) – The dimension to resample each image to
- mode ({'bilinear', 'neighbor'}) – The resampling strategy to employ. Default is ‘bilinear’.
Returns: resampled (
ndarray
of shape (n_ex, out_rows, out_cols, in_channels)) – The resampled image volume.- X (
nn_interpolate_2D
¶
-
numpy_ml.preprocessing.dsp.
nn_interpolate_2D
(X, x, y)[source]¶ Estimates of the pixel values at the coordinates (x, y) in X using a nearest neighbor interpolation strategy.
Notes
Assumes the current entries in X reflect equally-spaced samples from a 2D integer grid.
Parameters: - X (
ndarray
of shape (in_rows, in_cols, in_channels)) – An input image sampled along a grid of in_rows by in_cols. - x (list of length k) – A list of x-coordinates for the samples we wish to generate
- y (list of length k) – A list of y-coordinates for the samples we wish to generate
Returns: samples (
ndarray
of shape (k, in_channels)) – The samples for each (x,y) coordinate computed via nearest neighbor interpolation- X (
nn_interpolate_1D
¶
-
numpy_ml.preprocessing.dsp.
nn_interpolate_1D
(X, t)[source]¶ Estimates of the signal values at X[t] using a nearest neighbor interpolation strategy.
Parameters: - X (
ndarray
of shape (in_length, in_channels)) – An input image sampled along an integer in_length - t (list of length k) – A list of coordinates for the samples we wish to generate
Returns: samples (
ndarray
of shape (k, in_channels)) – The samples for each (x,y) coordinate computed via nearest neighbor interpolation- X (
bilinear_interpolate
¶
-
numpy_ml.preprocessing.dsp.
bilinear_interpolate
(X, x, y)[source]¶ Estimates of the pixel values at the coordinates (x, y) in X via bilinear interpolation.
Notes
Assumes the current entries in X reflect equally-spaced samples from a 2D integer grid.
Modified from https://bit.ly/2NMb1Dr
Parameters: - X (
ndarray
of shape (in_rows, in_cols, in_channels)) – An input image sampled along a grid of in_rows by in_cols. - x (list of length k) – A list of x-coordinates for the samples we wish to generate
- y (list of length k) – A list of y-coordinates for the samples we wish to generate
Returns: samples (list of length (k, in_channels)) – The samples for each (x,y) coordinate computed via bilinear interpolation
- X (
to_frames
¶
-
numpy_ml.preprocessing.dsp.
to_frames
(x, frame_width, stride, writeable=False)[source]¶ Convert a 1D signal x into overlapping windows of width frame_width using a hop length of stride.
Notes
If
(len(x) - frame_width) % stride != 0
then some number of the samples in x will be dropped. Specifically:n_dropped_frames = len(x) - frame_width - stride * (n_frames - 1)
where:
n_frames = (len(x) - frame_width) // stride + 1
This method uses low-level stride manipulation to avoid creating an additional copy of x. The downside is that if
writeable`=True
, modifying the frame output can result in unexpected behavior:>>> out = to_frames(np.arange(6), 5, 1) >>> out array([[0, 1, 2, 3, 4], [1, 2, 3, 4, 5]]) >>> out[0, 1] = 99 >>> out array([[ 0, 99, 2, 3, 4], [99, 2, 3, 4, 5]])
Parameters: - x (
ndarray
of shape (N,)) – A 1D signal consisting of N samples - frame_width (int) – The width of a single frame window in samples
- stride (int) – The hop size / number of samples advanced between consecutive frames
- writeable (bool) – If set to False, the returned array will be readonly. Otherwise it will be writable if x was. It is advisable to set this to False whenever possible to avoid unexpected behavior (see NB 2 above). Default is False.
Returns: frame (
ndarray
of shape (n_frames, frame_width)) – The collection of overlapping frames stacked into a matrix- x (
autocorrelate1D
¶
-
numpy_ml.preprocessing.dsp.
autocorrelate1D
(x)[source]¶ Autocorrelate a 1D signal x with itself.
Notes
The k th term in the 1 dimensional autocorrelation is
\[a_k = \sum_n x_{n + k} x_n\]NB. This is a naive \(O(N^2)\) implementation. For a faster \(O(N \log N)\) approach using the FFT, see [1].
References
[1] https://en.wikipedia.org/wiki/Autocorrelation#Efficient%computation Parameters: x ( ndarray
of shape (N,)) – A 1D signal consisting of N samplesReturns: auto ( ndarray
of shape (N,)) – The autocorrelation of x with itself
preemphasis
¶
-
numpy_ml.preprocessing.dsp.
preemphasis
(x, alpha)[source]¶ Increase the amplitude of high frequency bands + decrease the amplitude of lower bands.
Notes
Preemphasis filtering is (was?) a common transform in speech processing, where higher frequencies tend to be more useful during signal disambiguation.
\[\text{preemphasis}( x_t ) = x_t - \alpha x_{t-1}\]Parameters: - x (
ndarray
of shape (N,)) – A 1D signal consisting of N samples - alpha (float in [0, 1)) – The preemphasis coefficient. A value of 0 corresponds to no filtering
Returns: out (
ndarray
of shape (N,)) – The filtered signal- x (
cepstral_lifter
¶
-
numpy_ml.preprocessing.dsp.
cepstral_lifter
(mfccs, D)[source]¶ A simple sinusoidal filter applied in the Mel-frequency domain.
Notes
Cepstral lifting helps to smooth the spectral envelope and dampen the magnitude of the higher MFCC coefficients while keeping the other coefficients unchanged. The filter function is:
\[\text{lifter}( x_n ) = x_n \left(1 + \frac{D \sin(\pi n / D)}{2}\right)\]Parameters: - mfccs (
ndarray
of shape (G, C)) – Matrix of Mel cepstral coefficients. Rows correspond to frames, columns to cepstral coefficients - D (int in \([0, +\infty]\)) – The filter coefficient. 0 corresponds to no filtering, larger values correspond to greater amounts of smoothing
Returns: out (
ndarray
of shape (G, C)) – The lifter’d MFCC coefficients- mfccs (
mel_spectrogram
¶
-
numpy_ml.preprocessing.dsp.
mel_spectrogram
(x, window_duration=0.025, stride_duration=0.01, mean_normalize=True, window='hamming', n_filters=20, center=True, alpha=0.95, fs=44000)[source]¶ Apply the Mel-filterbank to the power spectrum for a signal x.
Notes
The Mel spectrogram is the projection of the power spectrum of the framed and windowed signal onto the basis set provided by the Mel filterbank.
Parameters: - x (
ndarray
of shape (N,)) – A 1D signal consisting of N samples - window_duration (float) – The duration of each frame / window (in seconds). Default is 0.025.
- stride_duration (float) – The duration of the hop between consecutive windows (in seconds). Default is 0.01.
- mean_normalize (bool) – Whether to subtract the coefficient means from the final filter values to improve the signal-to-noise ratio. Default is True.
- window ({'hamming', 'hann', 'blackman_harris'}) – The windowing function to apply to the signal before FFT. Default is ‘hamming’.
- n_filters (int) – The number of mel filters to include in the filterbank. Default is 20.
- center (bool) – Whether to the k th frame of the signal should begin at index
x[k * stride_len]
(center = False) or be centered atx[k * stride_len]
(center = True). Default is False. - alpha (float in [0, 1)) – The coefficient for the preemphasis filter. A value of 0 corresponds to no filtering. Default is 0.95.
- fs (int) – The sample rate/frequency for the signal. Default is 44000.
Returns: - x (
mfcc
¶
-
numpy_ml.preprocessing.dsp.
mfcc
(x, fs=44000, n_mfccs=13, alpha=0.95, center=True, n_filters=20, window='hann', normalize=True, lifter_coef=22, stride_duration=0.01, window_duration=0.025, replace_intercept=True)[source]¶ Compute the Mel-frequency cepstral coefficients (MFCC) for a signal.
Notes
Computing MFCC features proceeds in the following stages:
- Convert the signal into overlapping frames and apply a window fn
- Compute the power spectrum at each frame
- Apply the mel filterbank to the power spectra to get mel filterbank powers
- Take the logarithm of the mel filterbank powers at each frame
- Take the discrete cosine transform (DCT) of the log filterbank energies and retain only the first k coefficients to further reduce the dimensionality
MFCCs were developed in the context of HMM-GMM automatic speech recognition (ASR) systems and can be used to provide a somewhat speaker/pitch invariant representation of phonemes.
Parameters: - x (
ndarray
of shape (N,)) – A 1D signal consisting of N samples - fs (int) – The sample rate/frequency for the signal. Default is 44000.
- n_mfccs (int) – The number of cepstral coefficients to return (including the intercept coefficient). Default is 13.
- alpha (float in [0, 1)) – The preemphasis coefficient. A value of 0 corresponds to no filtering. Default is 0.95.
- center (bool) – Whether to the kth frame of the signal should begin at index
x[k * stride_len]
(center = False) or be centered atx[k * stride_len]
(center = True). Default is True. - n_filters (int) – The number of filters to include in the Mel filterbank. Default is 20.
- normalize (bool) – Whether to mean-normalize the MFCC values. Default is True.
- lifter_coef (int in :math:[0, + infty]`) – The cepstral filter coefficient. 0 corresponds to no filtering, larger values correspond to greater amounts of smoothing. Default is 22.
- window ({'hamming', 'hann', 'blackman_harris'}) – The windowing function to apply to the signal before taking the DFT. Default is ‘hann’.
- stride_duration (float) – The duration of the hop between consecutive windows (in seconds). Default is 0.01.
- window_duration (float) – The duration of each frame / window (in seconds). Default is 0.025.
- replace_intercept (bool) – Replace the first MFCC coefficient (the intercept term) with the log of the total frame energy instead. Default is True.
Returns: mfccs (
ndarray
of shape (G, C)) – Matrix of Mel-frequency cepstral coefficients. Rows correspond to frames, columns to cepstral coefficients
mel2hz
¶
-
numpy_ml.preprocessing.dsp.
mel2hz
(mel, formula='htk')[source]¶ Convert the mel-scale representation of a signal into Hz
Parameters: - mel (
ndarray
of shape (N, *)) – An array of mel frequencies to convert - formula ({"htk", "slaney"}) – The Mel formula to use. “htk” uses the formula used by the Hidden Markov Model Toolkit, and described in O’Shaughnessy (1987). “slaney” uses the formula used in the MATLAB auditory toolbox (Slaney, 1998). Default is ‘htk’
Returns: hz (
ndarray
of shape (N, *)) – The frequencies of the items in mel, in Hz- mel (
hz2mel
¶
-
numpy_ml.preprocessing.dsp.
hz2mel
(hz, formula='htk')[source]¶ Convert the frequency representaiton of a signal in Hz into the mel scale.
Parameters: - hz (
ndarray
of shape (N, *)) – The frequencies of the items in mel, in Hz - formula ({"htk", "slaney"}) – The Mel formula to use. “htk” uses the formula used by the Hidden Markov Model Toolkit, and described in O’Shaughnessy (1987). “slaney” uses the formula used in the MATLAB auditory toolbox (Slaney, 1998). Default is ‘htk’.
Returns: mel (
ndarray
of shape (N, *)) – An array of mel frequencies to convert.- hz (
mel_filterbank
¶
-
numpy_ml.preprocessing.dsp.
mel_filterbank
(N, n_filters=20, fs=44000, min_freq=0, max_freq=None, normalize=True)[source]¶ Compute the filters in a Mel filterbank and return the corresponding transformation matrix
Notes
The Mel scale is a perceptual scale designed to simulate the way the human ear works. Pitches judged by listeners to be equal in perceptual / psychological distance have equal distance on the Mel scale. Practically, this corresponds to a scale with higher resolution at low frequencies and lower resolution at higher (> 500 Hz) frequencies.
Each filter in the Mel filterbank is triangular with a response of 1 at its center and a linear decay on both sides until it reaches the center frequency of the next adjacent filter.
This implementation is based on code in the (superb) LibROSA package [1].
References
[1] McFee et al. (2015). “librosa: Audio and music signal analysis in Python”, Proceedings of the 14th Python in Science Conference https://librosa.github.io Parameters: - N (int) – The number of DFT bins
- n_filters (int) – The number of mel filters to include in the filterbank. Default is 20.
- min_freq (int) – Minimum filter frequency (in Hz). Default is 0.
- max_freq (int) – Maximum filter frequency (in Hz). Default is 0.
- fs (int) – The sample rate/frequency for the signal. Default is 44000.
- normalize (bool) – If True, scale the Mel filter weights by their area in Mel space. Default is True.
Returns: fbank (
ndarray
of shape (n_filters, N // 2 + 1)) – The mel-filterbank transformation matrix. Rows correspond to filters, columns to DFT bins.