Mel Frequency Cepstral Coefficient
It’s the year 2020! With all the things that are taking place in my life, I reckon that this year will be decisive for my life. May everyone have a wonderful year.
For a class project that I did this semester, I encountered a feature extraction technique for audio signals, i.e. the Mel Frequency Cepstral Coefficient, which proves to be very useful in many fields including speech recognition, music information retrieval, etc. In this article, I breifly introduce how the MFCC feature is computed.
The Mel Scale
Raw audio signals usually contain tens of thousands of signals per second. It is difficult for machine learning methods to operate on such a huge amount of data directly. Therefore, it is common to apply domainspecific feature extraction techniques to lower the dimensionality of audio data. A set of handcrafted features inspired by human hearing mechanisms have been proved successful in representing audio signals in a low dimensional space. The derivation of such features (i.e. acoustic features) requires knowledge on signal processing and human perception.
The human ears do not respond linearly to audio frequencies. Our hearing systems are more sensitive to lowfrequency sounds compared to highfrequency ones. As a result, the actual difference in Hertz for two pitches at the same perceptual distance (e.g. ) increases as the octave becomes higher. In 1937, Stevens et al. ^{1} proposed a new frequency measure called the Mel scale, under which the pitch distances are consistent with human perception. The MFCC feature adopts the Mel scale since it makes more sense for human perception related tasks. The conversion from hertz to mels suggested by Zheng et al.^{2} is shown in Equation \ref{eqn:mel}. The figure below shows the curve of Equation \ref{eqn:mel}.
\begin{align}\label{eqn:mel} m = 2595\log_{10}\left( 1 + \frac{f}{700} \right). \end{align}
The computation of MFCC features
To compute MFCC, we need to perform the following steps^{3} ^{4}:

Preprocessing and Discrete Fourier Transform (DFT)
The MFCC pattern is usually computed within a short time window (e.g. 1030ms), where the audio signal stays rather constant. Denote the number of timedomain samples in each window by , and the audio’s sample rate by . Suppose and the time window is selected to be 20 ms, then . When the window slide across the audio signal, there can be overlaps between adjacent windows.
Now, given a timedomain audio signal of length , we perform DFT to transform it into a frequency domain signal , which is represented with complex numbers. The DFT formula gives
where “” denotes the imaginary unit. Since the output of DFT is periodic, we are only interested in the first values in the DFT result.
Under the frequency domain, the power spectrum of the signal writes
where “” denotes the modulus of a complex number.

Melspaced Filter Bank
We have introduced the Mel scale, which is a frequency measure based on human psychology. In this step, we introduce a series of triangular filters (i.e. the filter bank) that are linearly spaced in Melscale to extract the signal’s characteristics at different frequencies. Equation \ref{eqn:mel} gives the transformation from Hertz scale to Mel scale. It is easy to see that its inverse transformation is
First of all, we need to define the start mel (), end mel () and number of filters for the filter bank (). It should be noted that the corresponding frequency in Hertz for must not exceed , as it is pointed out by the Nyquist–Shannon sampling theorem that the audio will not be able to capture frequencies higher than losslessly. The corresponding frequency for is approximately 9998, which is lower than half sample rate. As an example, we choose , and . To create a filter bank, we need to find linearly spaced points between and , which is called the array. In this case, it can be seen that
Then using Equation , we can convert into its corresponding Hertz scale array (denoted by ), which is given by
We need to find the DFT bins in that correspond to the frequencies in the array, which comprises the array. Let and be the th element of the and array, respectively. It can be seen that
In our example, the array is given by
Now, we define the th filter (denoted by ) in the bank as follows.
The figure below shows the shape of triangular filters, where each unique color denotes a filter.

Filter Bank Log Energy Computation
With each filter in the bank (), we compute convolution of the energy spectrum with respect to them. The th convoluted value, which is denoted by , is given by
This step is essentially extracting the log audio energies around each perceptual pitch. Eventually, we can put the log energies of each filter into a whole sequence , where

Discrete Cosine Transform (DCT)
The MFCC feature can be considered as “spectrum of a spectrum”, for it is computed by applying DCT to , which is the result of FFT. Finally, the th MFCC feature of the audio signal is given by
Usually, is discarded because it is the DCcomponent of the DCT transform, whose value is unstable.
References

S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the measurement of the psychological magnitude pitch”, The Journal of the Acoustical Society of America, vol. 8, no. 3, pp. 185–190, 1937. ↩

F. Zheng, G. Zhang, and Z. Song, “Comparison of different implementations of MFCC”, Journal of Computer science and Technology, vol. 16, no. 6, pp. 582–589, 2001. ↩

J. Lyons, Mel frequency cepstral coefficient (MFCC) tutorial. [Online]. Available: http://practicalcryptography.com/miscellaneous/machinelearning/guide melfrequencycepstralcoefficientsmfccs/. ↩

M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition”, Speech Communication, vol. 54, no. 4, pp. 543–565, 2012. ↩