Mel Frequency Cepstral Coefficient
01 Jan 2020It’s the year 2020! With all the things that are taking place in my life, I reckon that this year will be decisive for my life. May everyone have a wonderful year.
For a class project that I did this semester, I encountered a feature extraction technique for audio signals, i.e. the Mel Frequency Cepstral Coefficient, which proves to be very useful in many fields including speech recognition, music information retrieval, etc. In this article, I breifly introduce how the MFCC feature is computed.
The Mel Scale
Raw audio signals usually contain tens of thousands of signals per second. It is difficult for machine learning methods to operate on such a huge amount of data directly. Therefore, it is common to apply domainspecific feature extraction techniques to lower the dimensionality of audio data. A set of handcrafted features inspired by human hearing mechanisms have been proved successful in representing audio signals in a low dimensional space. The derivation of such features (i.e. acoustic features) requires knowledge on signal processing and human perception.
The human ears do not respond linearly to audio frequencies. Our hearing systems are more sensitive to lowfrequency sounds compared to highfrequency ones. As a result, the actual difference in Hertz for two pitches at the same perceptual distance (e.g. \(\mathrm{C}_4\operatorname{}\mathrm{D}_4, \mathrm{C}_7\operatorname{}\mathrm{D}_7\)) increases as the octave becomes higher. In 1937, Stevens et al. ^{1} proposed a new frequency measure called the Mel scale, under which the pitch distances are consistent with human perception. The MFCC feature adopts the Mel scale since it makes more sense for human perception related tasks. The conversion from \(f\) hertz to \(m\) mels suggested by Zheng et al.^{2} is shown in Equation \ref{eqn:mel}. The figure below shows the curve of Equation \ref{eqn:mel}.
\begin{align}\label{eqn:mel} m = 2595\log_{10}\left( 1 + \frac{f}{700} \right). \end{align}
The computation of MFCC features
To compute MFCC, we need to perform the following steps^{3} ^{4}:

Preprocessing and Discrete Fourier Transform (DFT)
\(\newcommand{\nicefrac}[2]{#1/#2}\) The MFCC pattern is usually computed within a short time window (e.g. 1030ms), where the audio signal stays rather constant. Denote the number of timedomain samples in each window by \(N\), and the audio’s sample rate by \(r\). Suppose \(r = 22050~\mbox{Hz}\) and the time window is selected to be 20 ms, then \(N = 22050 \times \nicefrac{20}{1000} = 441\). When the window slide across the audio signal, there can be overlaps between adjacent windows.
\(\newcommand{\imi}{\mathrm{i}}\) \(\newcommand{\imj}{\mathrm{j}}\) Now, given a timedomain audio signal \(s(n)\) of length \(N\), we perform DFT to transform it into a frequency domain signal \(S(k)\), which is represented with complex numbers. The DFT formula gives
\[\begin{align} S(k) = \sum_{n=1}^{N} s(n) \cdot e^{\frac{\imi 2\pi}{N}k(n1)} \quad k \in [1, \lfloor\nicefrac{N}{2}\rfloor], \end{align}\]where “\(\imi\)” denotes the imaginary unit. Since the output of DFT is periodic, we are only interested in the first \(\lfloor\nicefrac{N}{2}\rfloor\) values in the DFT result.
Under the frequency domain, the power spectrum of the signal \(P(k)\) writes
\[\begin{align} P(k) = \frac{1}{N} \left\lvert S(k) \right\rvert^2 \quad k \in [1, \lfloor\nicefrac{N}{2}\rfloor], \end{align}\]where “\(\lvert \cdot \rvert\)” denotes the modulus of a complex number.

Melspaced Filter Bank
We have introduced the Mel scale, which is a frequency measure based on human psychology. In this step, we introduce a series of triangular filters (i.e. the filter bank) that are linearly spaced in Melscale to extract the signal’s characteristics at different frequencies. Equation \ref{eqn:mel} gives the transformation from Hertz scale to Mel scale. It is easy to see that its inverse transformation is
\[\begin{align}\label{eqn:melinv} f = 700\left( 10^{\frac{m}{2595}}  1 \right). \end{align}\]First of all, we need to define the start mel (\(m_1\)), end mel (\(m_2\)) and number of filters for the filter bank (\(b\)). It should be noted that the corresponding frequency in Hertz for \(m_2\) must not exceed \(\lfloor \nicefrac{r}{2} \rfloor\), as it is pointed out by the Nyquist–Shannon sampling theorem that the audio will not be able to capture frequencies higher than \(\lfloor \nicefrac{r}{2} \rfloor\) losslessly. The corresponding frequency for \(m_2\) is approximately 9998, which is lower than half sample rate. As an example, we choose \(m_1 = 150\), \(m_2=3073\) and \(b = 10\). To create a filter bank, we need to find \(b + 2\) linearly spaced points between \(m_1\) and \(m_2\), which is called the \(\mathtt{mel}\) array. In this case, it can be seen that
\[\begin{multline*} \mathtt{mel} = \{150.00,415.73,681.45,947.18,1212.91,1478.64, \\ 1744.36,2010.09,2275.82,2541.55,2807.27,3073.00\}. \end{multline*}\]Then using Equation \(\ref{eqn:melinv}\), we can convert \(\mathtt{mel}\) into its corresponding Hertz scale array (denoted by \(\mathtt{hertz}\)), which is given by
\[\begin{multline*} \mathtt{hertz} = \{99.65,312.28,581.45,922.19,1353.53,1899.56, \\ 2590.79,3465.81,4573.50,5975.73,7750.82,9997.90\}. \end{multline*}\]We need to find the DFT bins in \(S(k)\) that correspond to the frequencies in the \(\mathtt{hertz}\) array, which comprises the \(\mathtt{bin}\) array. Let \(\mathtt{hertz}(i)\) and \(\mathtt{bin}(i)\) be the \(i\)th element of the \(\mathtt{hertz}\) and \(\mathtt{bin}\) array, respectively. It can be seen that
\[\begin{align} \mathtt{bin}(i)=\left\lfloor \frac{N \cdot \mathtt{hertz}(i)}{r} \right\rfloor. \end{align}\]In our example, the \(\mathtt{bin}\) array is given by
\[\begin{align*} \mathtt{bin}=\{1,6,11,18,27,37,51,69,91,119,155,199\}. \end{align*}\]Now, we define the \(i\)th filter (denoted by \(\psi_i\)) in the bank as follows.
\[\begin{align} \psi_i(k) = \begin{cases} \frac{k  \mathtt{bin}(i)}{\mathtt{bin}(i+1)\mathtt{bin}(i)} & \mbox{if}~\mathtt{bin(i)} \leq k < \mathtt{bin}(i+1),\\ 1\frac{k  \mathtt{bin}(i+1)}{\mathtt{bin}(i+2)\mathtt{bin}(i+1)} & \mbox{if}~\mathtt{bin(i+1)} \leq k < \mathtt{bin}(i+2),\\ 0 &\mbox{otherwise}. \end{cases} \end{align}\]The figure below shows the shape of triangular filters, where each unique color denotes a filter.

Filter Bank Log Energy Computation
With each filter \(\psi_i\) in the bank (\(i=1, \ldots, b\)), we compute convolution of the energy spectrum \(P(k)\) with respect to them. The \(i\)th convoluted value, which is denoted by \(\theta_i\), is given by
\[\begin{align} \theta_i = \ln \left[\sum_{k=1}^{\lfloor\nicefrac{N}{2}\rfloor} P(k)\cdot\psi_i(k)\right] \quad i=1, 2, \ldots, b. \end{align}\]This step is essentially extracting the log audio energies around each perceptual pitch. Eventually, we can put the log energies of each filter into a whole sequence \(\phi\), where
\[\begin{align} \phi = \{ \theta_1, \theta_2, \ldots, \theta_b \}. \end{align}\] 
Discrete Cosine Transform (DCT)
The MFCC feature can be considered as “spectrum of a spectrum”, for it is computed by applying DCT to \(\phi\), which is the result of FFT. Finally, the \(i\)th MFCC feature of the audio signal \(s(n)\) is given by
\[\begin{align} \mathrm{MFCC}(i) = \sum_{k=1}^{^b} \theta_k \cos \left[ \frac{\pi}{b} (k + \frac{1}{2}) (i1) \right] \quad i=1, 2, \ldots, b. \end{align}\]Usually, \(\mathrm{MFCC}(1)\) is discarded because it is the DCcomponent of the DCT transform, whose value is unstable.
References

S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the measurement of the psychological magnitude pitch”, The Journal of the Acoustical Society of America, vol. 8, no. 3, pp. 185–190, 1937. ↩

F. Zheng, G. Zhang, and Z. Song, “Comparison of different implementations of MFCC”, Journal of Computer science and Technology, vol. 16, no. 6, pp. 582–589, 2001. ↩

J. Lyons, Mel frequency cepstral coefficient (MFCC) tutorial. [Online]. Available: http://practicalcryptography.com/miscellaneous/machinelearning/guide melfrequencycepstralcoefficientsmfccs/. ↩

M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition”, Speech Communication, vol. 54, no. 4, pp. 543–565, 2012. ↩