Mel frequency cepstral coefficients

Wow thats a lotta data compression

Mel frequency cepstral coefficients have always fascinated me. They are incredibly abstruse. I’ve never found any mathematical intuitions for why they work. Yet they’re effective!

MFCCs were originally developed in the 1970s, where their use was primarily geared toward digital speech recognition. They have persisted through several generations of machine intelligence development, from early work in genre classificationG. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” in IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, July 2002. to modern deep learning networks.

A useful characteristic of MFCCs is that they are an example of dimensionality reduction, specific to the audio domain. Raw audio data comes in the form of a long series of sample values, corresponding to instantaneous pressure levels of a sound wave traveling through air. For a few seconds of audio, hundreds of thousands of sample values are used to represent it faithfully.

Its impossible to look at any one sample level and make a determination about the underlying signal. Occasionally, for simple signals, you can examine a small group of samples and make a determination about its frequency or rhythm. But for a waveform capturing any non-trivial auditory scene its not really possible to look directly at the values of the waveform and ask questions like, “is this a bird chirping, or a trumpet sounding, or two people talking?”

Therefore, audio waveforms are described as having high dimensionality, because there are hundreds of thousands, millions, or more sample levels, all of which could take on any value independent of each other. High-dimensionality data is a problem for analysis because its difficult or impossible to draw any conclusions about it; any single value does not say a lot about whats going on in the data.

Much of modern deep learning works by taking high dimensionality data and reducing it down to a latent space where it can be represented using smaller set of values that are still descriptive of some set of characteristics of the original data. MFCCs accomplish this without any deep learning, reducing data down to just 10–20 values that still retain useful descriptive power of the underlying signal.

Definition of MFCCs

MFCCs are computed as follows:

Compute the spectrum of a signal via discrete Fourier transform (DFT).
For each frequency component of the Fourier transform, square it to get the power spectrum.
Map the power spectrum onto the mel scale using triangular overlapping windows.
Take the logs of the powers at each mel frequency.
Take the discrete cosine transform (DCT) of the log powers.
The MFCCs are the real amplitudes of the resulting spectrum.

An intuitive way of summarizing this procedure is that we are taking the spectrum (frequency content) of a signal, reducing the number of frequencies represented in a way that better matches human perception, and then computing the frequency trends across that space.

Lets examine each step in more detail.

Discrete Fourier transform

Like so many audio analysis tasks, our journey begins with the discrete Fourier transform, or DFT. The discrete Fourier transform is the alpha and the omega of digital signal analysis.Its called the “discrete” Fourier transform because it deals with signals whose levels are sampled at discrete points in time. This is a requirement of digital systems, which have finite memory and thus segment a conceptually continuous audio signal into a series of sample levels. There is a continuous Fourier transform dealing with continuous or analog signals, normally called the Fourier transform, without additional qualification, as it was formulated first. Its defined as:

$X(f) = \displaystyle\sum_{n = 0}^{N-1} x(n) e^{\frac{-i 2\pi f n T}{N}}$

where

$f$ is some frequency of interest,
$T$ is the sampling period, defined as the amount of time in between two discrete samples of the underlying signal, and the inverse of the sample rate,
$N$ is the length (in samples) of the input signal,
$x(n)$ is the input audio signal at index $n$ , corresponding to some time $n T$ , and
$X(f)$ is the complex amplitude of that frequency present in the input signalComplex amplitude means a value representing both “real” or scalar amplitude, which is roughly perceived as loudness, and phase, which represents a shift of the signal in time but has no specific perceptual correlation. It is complex in the sense that it is mathematically modeled using a complex number, i.e. $x + i y$ .

You may have also heard people speak of the “FFT” (fast Fourier transform). The FFT refers a specific implementation of the discrete Fourier transform which is much faster than the naive implementation. For whatever reason, “FFT” has become the common parlance for what is actually the DFT. While technically imprecise, this isn’t inherently problematic, but you should know that when someone says “FFT” they usually mean the DFT defined above.

Effectively, its multiplying the input signal by a complex sinusoidA complex sinusoid is any function of the form $A e^{2\pi i f t + \phi}$ , which is equivalent to $A cos(2\pi f t + \phi) + i A sin(2\pi f t + \phi)$ . This function is characterized by its frequency $f$ and phase shift $\phi$ . at each frequency and summing the resulting set of values. This can be thought of as a mathematical projection from the time domain (where the signal can be measured in the first place, and later auditioned) to the frequency domain (sometimes called the spectral domain).⁵

The effect of squaring the signal in step 2 above is to convert to its signal power, which has some relationship to perceptual and physical properties.However, taking the $log$ later in the process means that the effect of the squaring will be fairly inconsequential, since $log(X^2) = 2 log(X)$ . The end result after applying the DCT is that all of the MFCCs are just scaled up by 2, so theres no difference in their relative proportions if you square or do not square. Evidently, power is still used for historical reasons and by convention.

The Mel Scale

A typical DFT implementation will provide complex amplitudes for a set of frequencies evenly distributed from 0 Hz up to $\frac{1}{2}$ of the sampling frequency (known as the Nyquist frequency). For example, for a DFT of a signal of length 1000 and a sample rate of 48000 Hz, you get an amplitude for $f_k = \{ 0, 24, 48, ..., 23952, 23976 \} Hz$ .

While this is a eminently logical way to arrange frequencies, it does mean that just as much data is used to represent frequencies between 2000–2200 Hz as 200–400 Hz. Why does that matter? The human hearing system is less attuned to small differences at higher frequencies, so many of the amplitudes at higher frequencies are modeling frequencies that, to human ears, aren’t perceptibly different from any of the other frequencies near it. Similarly, our hearing system can better distinguish two different frequencies if they are in a lower range of frequencies, typically less than 1000 Hz.

From an analytical perspective, we may as well combine some of these higher frequencies into a single value. Collapsing the amplitudes of higher frequencies into a smaller set of values helps in the task of dimensionality reduction, which we previously identified as a major benefit of audio analysis using MFCCs.

MFCCs accomplish this using something called the mel scale. The name “mel” was chosen simply as a shortening of “melody”. The mel scale was designed to model frequency relationships in a way that is more meaningful to how frequencies are actually heard by the human auditory system.

Over the decades, several formulae have been proposed and regularly used to a frequency in Hz and get the corresponding mel frequency (conveniently denominated in units of mels). The conventional formula is:

$m = 2595 \log_{10}\left(1 + \frac{f}{700}\right)$ This arbitrary-looking equation was devised to approximate empirical data on “just noticeable differences” between two frequencies, compiled from psychoacoustic experiments in the early 20th century in which subjects were asked to listen to pure tones at different frequencies and state if they heard a difference.

The inverse transformation is:

$f = 700\left(10^{m/2595} - 1\right)$

Applying the mel scale

From a high level, we apply the mel scale by taking a set of amplitudes corresponding to linear frequency (the result of the DFT) and arrive at a set of amplitudes at specific mel frequencies. We accomplish this using a window function.

A window function selects a limited range of values from some other function, zeroing out everything outside the window. Window functions are oriented around a central middle value, and most window functions attenuate values on either side of this middle value before zeroing out the values that are too far from the middle.

The figure above shows an example of a triangular window function. The dashed gray line represents an original signal (here, a simple sine wave). The red triangle is the window function, which peaks at the center frequency and linearly slopes to zero on either side. The blue line shows the result of multiplying the signal by the window: values near the center are preserved while those at the edges are attenuated.

When converting the DFT to the mel scale, we sum all of the values left over from the window function to arrive at a single value for the mel frequency at the center of the window function. According to the definition presented above, each triangular window is centered at the mel frequency of interest and extends on either side to the previous and next mel frequency.

The plot above shows an example of arranging triangular filter banks along the mel scale. Each triangle is centered at a mel frequency, but because the mel scale is logarithmic with respect to linear frequency, the filters become progressively wider at higher frequencies. This reflects the fact that human hearing has finer frequency resolution at lower frequencies. The energy under each window is accumulated to represent the singular result for the mel frequency at the center of the corresponding window.

We can compute this as follows. Fundamentally, the window function is deciding how much of the signal level at frequency $f$ we want to contribute to the level of mel frequency denoted by $k$ . Call this quantity $H_k(f)$ ; it defines the fraction of the power spectrum at frequency $f$ to include in the window. This should be 1 at the center of the window, 0 past the bounds of the window, and a linear slope from 0 to 1 between the center and either extreme.

$H_k(f) = \begin{cases} \frac{f - f_{k-1}}{f_k - f_{k-1}} & f_{k-1} \leq f < f_k \\ \frac{f_{k+1} - f}{f_{k+1} - f_k} & f_k \leq f < f_{k+1} \\ 0 & \text{otherwise} \end{cases}$

Here, $k$ corresponds to the $k$ -th mel frequency and $f_k$ is the conventional frequency corresponding to the $k$ -th mel frequency (computed using the formula from the previous section).

Next we apply the window for each mel frequency of interest, and sum all of the windowed values to get the power for that mel frequency.

$M(k) = \displaystyle\sum_{f} H_k(f) \cdot |X(f)|^2$

Lastly, how do we determine what mel frequencies to use (the values noted by $k$ here)? Generally, you select a minimum and maximum conventional frequency, convert these to mel frequencies, divide up the space between these mel frequencies equally among the desired number of mel frequencies to represent (also referred to as mel bins), and then convert these back to conventional frequencies.

For instance, for a minimum frequency of 0 Hz and maximum of 22050 Hz, and 128 mel bins, we get the following conventional frequencies:

$f_k = \{0, 19.45, 39.45, 60, 81.12, 102.83, 125.14, \newline\hspace{2.6em} 148.07, 171.64, 195.86, ..., 22050\}$

The DCT

The final step uses the Discrete Cosine Transform (DCT). The $n$ -th MFCC coefficient $c_n$ is given by:

$c_n = \displaystyle\sum_{k = 1}^{K} M(k) \cos\left[n\left(k - \frac{1}{2}\right)\frac{\pi}{K}\right]$

where $M(k)$ is the log energy output of the $k$ -th filter bank and $K$ is the total number of mel bins.

The DCT is a cousin of the DFT, with some key differences. One of these is obviously that it only uses the (real) cosine function, whereas the DFT uses a complex exponential involving both sine and cosine. As a result, the DCT can be calculated without the use of complex arithmetic. Another key difference is that, in this version of the DCTThere are actually eight different variants of the DCT, of which four are evidently used with much regularity. , the cosine function is offset by half a sample, having to do with the symmetry implied in the underlying signal.

My understanding is that the DCT is used because it represents signals with more information in the lower frequencies than the standard DFT.Its used widely in image compression for this reason. As a consequence of this property, only the first dozen or so coefficients are used, as there is enough data here to make useful predictions about the original signal.

Example MFCCs

Here are MFCCs computed from two different audio sources: a spoken word and a clip of pop music. The MFCC values were calculated using librosa’s mfcc function. Both samples were clipped to a 16384 sample window to result in a single frame of 20 MFCC values; typically a smaller frame size is used with frames spaced out over time to capture time-varying changes in MFCC values as well.

Speech example ("one"):From the AudioMNIST dataset, S. Becker et al., “AudioMNIST: Exploring Explainable Artificial Intelligence for audio analysis on a simple benchmark,” Journal of the Franklin Institute, 2023.

Pop music excerpt:From the GTZAN dataset, G. Tzanetakis and P. Cook, “Musical Genre Classification of Audio Signals,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, 2002.

Cursorily, its easy to see that these are pretty distinct sets of values between speech and music. However, you would need to examine a larger set of examples of each to understand the commonalities and distinctions between the MFCCs of either audio class.