Alternatives have been presented, but they have not gained popularity, probably due to minor effect on outcome. The choice of triangular weighting filters w k,h is arbitrary and not based on well-grounded motivations.That is, the performance of MFCCs in presence of additive noise, in comparison to other features, has not always been good. However, these alternative filterbanks have not demonstrated consistent benefit, whereby the mel-scale has persisted. Scales such as the ERB or gamma-tone filterbanks might be better suited. The choice of perceptual scale is not well-motivated.Some of the issues with the MFCC include: Their performance is well-tested and -understood.Straightforward and computationally reasonably efficient calculation.It thus focuses on that part of the signal which is typically most informative. At the same time, it removes fine spectral structure (micro-level structure), which is often less important. Quantifies the gross-shape of the spectrum (the spectral envelope), which is important in, for example, identification of vowels.The beneficial properties of the MFCCs include: If you're unsure which inputs to give to a speech and audio recognition engine, try first the MFCCs. It is used because it works and because it has relatively low complexity and it is straightforward to implement. Though the argumentation for the MFCCs is not without problems, it has become the most used feature in speech and audio recognition applications. It is an abstract domain, which contains information about the spectral envelope of the speech signal. Where the mel-weighted spectrogram does retain the original shape of the spectrum, the MFCCs do not offer such easy interpretations. The fourth figure illustrates the outcome once the mel-weighted spectrogram is multiplied with a DCT to obtain the final MFCCs. Since the identity of phonemes such as vowels is determined based on macro-shapes in the spectrum, the MFCCs thus preserve that type of information and remove "unrelated" information such as the pitch. In essence, this process thus removes the details related to the harmonic structure. Here we see that the gross-shape of the spectrogram is retained, but the fine-structure has been smoothed out. When each window of that spectrogram is multiplied with the triangular filterbank, we obtain the mel-weighted spectrum, illustrated in the third figure. The second figure shows the spectrogram of a speech segment. The process of acquiring MFCCs from a spectrogram is illustrated on the right, where on the top, there is a triangular filterbank placed at linear steps on the mel-frequency scale. In addition, we can integrate (or sum) neighboring frequencies, for example as The output y will then never go lower than a threshold y≥ log(e). Specifically, instead of y=log(|x| 2), we can use y=log( |x| 2+e), where e is a small positive number. To reduce the likelihood of such problematic values, we can use for example an energy bias similar to the mu-law rule or integrate energies over frequencies. However, for computations in the log-spectrum, arbitrarily large negative values are a problem. Though such values are "difficult" for visualizations, they are inconsequential for auditory perception and can be often ignored. The only exception is zeros and other very small values in the magnitude spectrum, which give negative infinities or arbitrarily large negative values in the log spectrum.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |