IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 6, AUGUST 2011 1791
An Auditory-Based Feature Extraction Algorithm
for Robust Speaker Identification Under
Mismatched Conditions
Qi Li, Senior Member, IEEE, and Yan Huang, Member, IEEE
Abstract—An auditory-based feature extraction algorithm is
presented. We name the new features as cochlear filter cepstral
coefficients (CFCCs) which are defined based on a recently de-
veloped auditory transform (AT) plus a set of modules to emulate
the signal processing functions in the cochlea. The CFCC features
are applied to a speaker identification task to address the acoustic
mismatch problem between training and testing environments.
Usually, the performance of acoustic models trained in clean
speech drops significantly when tested in noisy speech. The CFCC
features have shown strong robustness in this kind of situation. In
our experiments, the CFCC features consistently perform better
than the baseline MFCC features under all three mismatched
testing conditions—white noise, car noise, and babble noise. For
example, in clean conditions, both MFCC and CFCC features
perform similarly, over 96%, but when the signal-to-noise ratio
(SNR) of the input signal is 6 dB, the accuracy of the MFCC
features drops to 41.2%, while the CFCC features still achieve an
accuracy of 88.3%. The proposed CFCC features also compare
favorably to perceptual linear predictive (PLP) and RASTA-PLP
features. The CFCC features consistently perform much better
than PLP. Under white noise, the CFCC features are significantly
better than RASTA-PLP, while under car and babble noise, the
CFCC features provide similar performances to RASTA-PLP.
Index Terms—Auditory-based features, automatic speaker
recognition (ASR), cochlea, feature extraction algorithm, robust
speaker recognition, speaker identification.
I. INTRODUCTION
F EATURE extraction is the first crucial component in auto-matic speech processing. Generally speaking, successful
front-end features should carry enough discriminative informa-
tion for classification or recognition, fit well with the back-end
modeling, and be robust with respect to the changes of acoustic
environments. To the best of our knowledge, obtaining a sat-
isfactory system performance under various operating modes
still remains problematic, especially when acoustic training and
testing environments are mismatched. Since the human hearing
Manuscript received July 29, 2009; revised January 06, 2010; accepted De-
cember 09, 2010. Date of publication December 23, 2010; date of current ver-
sion June 01, 2011. This work was supported by the U.S. AFRL under the Con-
tract FA8750-08-C0028. The associate editor coordinating the review of this
manuscript and approving it for publication was Dr. Richard C. Rose.
Q. Li is with Li Creative Technologies, Inc., Florham Park, NJ 07932 USA
(e-mail: qili@ieee.org).
Y. Huang was with Li Creative Technologies, Inc., Florham Park, NJ 07932
USA. She is now with Microsoft Corporation, Redmond, WA 98052 USA
(e-mail: yanhuang@ieee.org).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASL.2010.2101594
system is robust to the mismatched conditions, we propose an
auditory-based feature extraction algorithm that is modeled on
the basic signal processing functions in the ear. The proposed al-
gorithm is also based on our recently developed auditory-based
time–frequency transform named Auditory Transform (AT) [1],
[2]. The features generated from the proposed algorithm are
named cochlear filter cepstral coefficients (CFCCs).
A. Traditional Speech Feature Extraction and the Fourier
Analysis
At a high level, most speech feature extraction methods fall
into the following two categories: modeling the human voice
production system or modeling the peripheral auditory system.
For the first approach, one of the most popular features is
a group of cepstral coefficients derived from linear prediction
known as the linear prediction cepstral coefficients (LPCCs)
[3], [4]. The LPCC feature extraction utilizes an all-pole filter
to model the human vocal tract with speech formants captured
by the poles of the all-pole filter. The narrow band (e.g., up to
4 kHz) LPCC features work well in a clean environment. How-
ever, in our previous experiments, the linear predictive spectral
envelope shows large spectral distortion in noisy environments
[5], [6]. This results in significant performance degradation.
For the second approach, there are two groups of features,
based on either Fourier transforms (FTs) or auditory-based
transforms. Representative for the first group are the Mel
frequency cepstral coefficients (MFCCs), where a fast Fourier
transform (FFT) is applied to generate the spectrum in the linear
scale, and then a bank of band-pass filters is placed along a Mel
frequency scale on top of the FFT output [7]. Alternatively, the
FFT output is warped to a Mel or Bark scale and then a bank
of band-pass filters is placed linearly on top of the warped FFT
output [5], [6]. The proposed algorithm in this paper belongs
to the second group, where the auditory-based transform is
defined as an invertible, time–frequency transform. The output
from this kind of transform can be in any kind of frequency
scale (e.g., linear, Bark, ERB, etc.). Therefore, there is no need
to place the bandpass filter in a Mel scale as in the MFCC or
warp the frequency distributions as in [5], [6].
The MFCC features [7] in the first group are one of the
most popular features for speech and speaker recognition. Like
the LPCC features, the MFCC features perform well in clean
environments but not in adverse environments or mismatched
training and testing conditions. Perceptual linear predictive
(PLP) analysis is another peripheral auditory-based approach.
Based on the FFT output, it uses several perceptually motivated
1558-7916/$26.00 © 2010 IEEE
1792 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 6, AUGUST 2011
transforms, including Bark frequency, equal-loudness preem-
phasis, and cubic-root amplitude compression [8]. The relative
spectra, known as RASTA, is further developed to filter the time
trajectory to suppress constant factors in the spectral component
[9]. It is often cascaded with the PLP feature extraction to form
the RASTA-PLP features. Comparisons between MFCC and
RASTA-PLP have been reported in [10]. Further comparisons
with the proposed CFCC features in experiments will be given
at the end of this paper.
Both MFCC and RASTA-PLP features are based on the
Fourier transform (FT). As mentioned above, the FT has a
fixed time–frequency resolution and a well-defined inverse
transform. Fast algorithms exist for both the forward transform
and the inverse transform. Despite its simplicity and efficient
computation algorithms, we believe that when applied to speech
processing the time–frequency decomposition mechanism of
the FT is different than the mechanism in the hearing system.
First, it uses fixed-length windows, which generate pitch har-
monics over the entire speech bands. Second, its individual
frequency bands are distributed linearly, which is different
from the distribution in the human cochlea. Further wrapping
is needed to convert to the Bark, MEL, or other scales. Finally,
in our recent study in [1], [2], we observed that the FFT spec-
trogram has more noise distortion and computation noise than
an auditory-based transform which we recently developed. One
of the examples is shown in Fig. 4. Thus, we find it necessary
to develop a new feature extraction algorithm based on the new
auditory-based, time–frequency transform [2] to replace the FT
in speech feature extraction.
B. Auditory-Based Time–Frequency Analysis
The traveling wave of the basilar membrane (BM) in the
cochlea and its impulse response have been observed and
reported in the literature, such as [11]–[17]. Moreover, the
BM tuning and auditory filters have also been studied in the
literature [18]–[23]. Many electronic and mathematic models
have been defined to simulate the traveling wave, the auditory
filters, and the frequency responses of the BM [24]–[34].
Based on the study of the human hearing system, Li pro-
posed an auditory-based, time–frequency transform (AT) in [1],
[2]. The new transform is comprised of a pairing of a forward
transform and an inverse transform. Through the forward trans-
form, the speech signal can be decomposed into a number of
frequency bands using a bank of cochlear filters. The frequency
distribution of the cochlear filters is similar to the one in the
cochlea and the impulse response of the filters is similar to that
of the travelling wave. Through the inverse transform, the orig-
inal speech signal can be reconstructed from the decomposed
bandpass signals. In [2], Li has presented the proof of the inverse
transform of the AT and validated the inverse AT in experiments.
Compared to the FFT, the AT has flexible time–frequency res-
olution and its frequency distribution can take on any linear or
nonlinear form. Therefore, it is easy to define the distribution to
be similar to that of the Bark, Mel, or ERB scale, which is sim-
ilar to the frequency distribution function of the Basilar mem-
brane. Most importantly, the proposed transform has significant
advantages in noise robustness and can be free of the pitch har-
monic distortion as plotted in [2] and Fig. 4. Therefore, the AT
Fig. 1. Schematic diagram of the proposed auditory-based feature extraction
algorithm named CFCCs.
provides a new platform for feature extraction research. It forms
the foundation for our robust feature extraction algorithm.
In summary, the ultimate goal of this study is to develop
a practical, front-end speech feature extraction algorithm that
conceptually emulates the human peripheral hearing system and
uses the concept to achieve an improved noise robustness per-
formance under mismatched training and testing conditions.
The remainder of this paper is organized as follows: Section II
demonstrates the proposed auditory feature extraction algo-
rithm and provides an analytic study and discussion; Section III
studies the feature parameters using a development dataset
and presents the experimental results of the proposed CFCC
in comparison to other front-end features in a testing dataset;
finally, Section IV concludes the paper.
II. PROPOSED AUDITORY-BASED FEATURE
EXTRACTION ALGORITHM
This section describes the structure of the proposed audi-
tory-based feature extraction algorithm and provides details of
its computation. Although we would like to emulate the human
peripheral hearing system, the computational aspects must meet
the requirements of real-time applications; therefore, we will
simulate only the most important features of the human periph-
eral hearing system.
An illustrative block diagram of the proposed algorithm is
shown in Fig. 1. The proposed algorithm is intended to con-
ceptually replicate the hearing system at a high level and con-
sists of the following modules: auditory transform implemented
by a cochlear filter bank, hair-cell function with windowing,
cubic-root nonlinearity, and discrete cosine transform (DCT).
A detailed description of each module follows.
A. Auditory Transform
The auditory transform in Fig. 1 is the forward transform of
a pair of invertible auditory-based transforms, as defined and
described by Li in [2]. It can be implemented as a filter bank.
As the foundation of the auditory-based feature extraction al-
gorithm, we use the forward auditory transform to replace the
fast Fourier transform used in many other features. The auditory
transform models the traveling wave in the cochlea where the
sound waveform is decomposed into a set of subband signals.
Let be a speech signal. A transform of with re-
spect to a cochlear filter , representing the basilar membrane
(BM) impulse response in the cochlea, is defined as
(1)
LI AND HUANG: AUDITORY-BASED FEATURE EXTRACTION ALGORITHM FOR ROBUST SPEAKER IDENTIFICATION 1793
where denotes the convolution operation, and are real, both
and belong to , and representing the
traveling waves in the BM is the decomposed signal and filter
output. The above equation can also be written as
(2)
where
(3)
Like in the wavelet transform, the factor is a scale or dilation
variable. By changing , we can shift the central frequency of
to receive a band of decomposed signals. Factor is a time shift
or translation variable. For a given value of , factor shifts the
function by an amount along the time axis.
Note that is an energy normalizing factor. It ensures
that the energy stays the same for all and ; therefore, we have
(4)
The cochlear filter, as the most important part of the transform,
is defined as
(5)
where and , is the unit step function; i.e.,
for and 0 otherwise. Parameters and deter-
mine the shape and width of the cochlear filter in the frequency
domain. They can be empirically optimized as shown in our ex-
periments in Section III. The value of should be selected such
that (6) is satisfied as follows:
(6)
This is required by the transform theory to ensure no informa-
tion is lost during the transform [2]. The value of can be deter-
mined by the current filter central frequency and the lowest
central frequency in the cochlear filter bank
(7)
Since we contract with the lowest frequency along the
time axis, the value of is in the range . If we stretch
, the value of would be constrained to . The frequency
distribution of the cochlear filter can be in the form of linear
or nonlinear scales such as ERB (equivalent rectangular band-
width) [26], Bark [35], Mel scale [7], or log. For a particular
band number , the corresponding value of is represented as
, which needs to be pre-calculated for the required central fre-
quency of the cochlear filters at band number .
Fig. 2 shows the impulse responses for five cochlear filters
plotted using (5), which are similar to the psychoacoustic exper-
Fig. 2. Impulse responses of the BM in the auditory transform (AT) when � �
� and � � ���, plotted by (5). The labels on the far left of each subplot represent
the central frequency of the plotted impulse response. They are very similar to
psychological measurements, such as the figures in [11], [12], [36, Fig. 1.12],
[13], etc.
Fig. 3. Frequency responses of the cochlear filters when � � �: (a) � � ���;
and (b) � � �����. The filter band width can be adjusted by � for different
applications.
iment results, such as the impulse responses plotted in [11], [13].
Fig. 3 shows the corresponding frequency responses. Normally,
we use . The value of controls the filter band width;
i.e., the Q-factor. This makes our auditory transform (AT) dif-
ferent than the Gammatone function [37] in which the Q-factor
is fixed.
We note that the inverse transform of the above transform ex-
ists. It has been proven mathematically and validated experi-
mentally [2]. This property ensures that the forward transform
implemented by the cochlear filter bank can avoid any informa-
tion loss and thus qualifies as a platform for feature extraction.
B. Other Operations
The cochlear filter bank is intended to emulate the impulse
response in the cochlea. However, there are other operations
1794 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 6, AUGUST 2011
in the ear. The inner hair cells act as a transducer for mechan-
ical movements of the BM into neural activities. When the BM
moves up and down, a shearing motion is created between the
BM and the tectorial membrane [36]. It causes the displacement
of the uppermost hair cells which generates the neural signals.
However, the hair cells only generate the neural signals in one
direction of the BM movement. When the BM moves in the op-
posite direction, there is neither excitation nor neuron output.
We studied different implementations of the hair cell function.
The following function of the hair cell output provides the best
performance in our evaluated task:
(8)
where is the filter-bank output from (1). Here, we as-
sume that all other detailed functions in the outer ear, middle
ear, and the control of the neural system to the cochlea have been
ignored or have been included in the auditory filter responses.
In the next step, the hair cell output for each band is converted
into a representation of nerve spike count density. The duration
of the count can be associated with the current band central fre-
quency. We use the following equation to mimic the concept:
(9)
where ms is the window length, is the
period of the central frequency of the th band, and ms
is the window shift duration. We empirically set the system pa-
rameters, but they may need to be adjusted for different datasets.
Instead of using a fixed length window as in the FFT, we are
using a variable length window for different frequency bands.
The higher the frequency, the shorter the window. This prevents
the high-frequency information from being smoothed out by a
long window duration. The output of the above equation and the
spectrogram of the cochlear filter bank can be used for both fea-
ture extraction and analysis.
Furthermore, we apply the scales of loudness function sug-
gested by Stevens [38], [39] to the hair cell output as
(10)
In the last step, the discrete cosine transform (DCT) is applied to
decorrelate the feature dimensions and to generate the CFCCs,
so the features can work with the existing back-end.
C. Analysis and Comparison
We provide a comparative analysis of the auditory transform
(AT) and the well-known Fourier transform (FT), and then ex-
tend the comparison to the features derived from the AT, such
as the CFCCs, and from the FT, such as the MFCCs. The anal-
ysis and discussion are intended to help the reader understand
the CFCCs. Further comparisons will be made in Section III.
1) Comparison Between AT and FT: The FFT is the major
tool for the time–frequency transform used in speech signal pro-
cessing. We use Fig. 4 to illustrate the differences between the
spectrograms generated from the Fourier transform and our au-
ditory transform [2]. The original speech wave file is recorded
from a male voice. We then calculated the FFT spectrograms
as shown in Fig. 4(a), with 30-ms Hamming window shifting
Fig. 4. Comparison of FT and AT spectrums: (a) The FFT spectrogram of
a male voice “2 0 5,” warped into the Bark scale from 0 to 6.4 Barks (0 to
3500 kHz). (b) The spectrogram from the cochlear filter output for the same
male voice. The proposed AT is harmonic free and has less noise.
every 10 ms. To facilitate the comparison, we then warped the
frequency distribution from linear scale to the Bark scale using
the method in [6].
The spectrogram of our auditory transform is shown in
Fig. 4(b). It was generated from the output of the cochlear filter
bank as defined in (5) and uses a window of fixed duration to
compute the average densities for each band. In comparing the
two spectrograms in Fig. 4, we can observe that there are no
pitch harmonics and there is less computational noise in the
spectrums generated from the auditory transform. In addition,
all formant information has been kept. This is due to the vari-
able length of cochlear filters and the selection of parameter
in (5). The harmonics in FFT spectrogram are due to the fixed
window length for all frequency bands.
Furthermore, we compared the spectrums shown in Fig. 5. A
male voice was recorded in a moving car using two different
microphones. A close-talking microphone was placed on the
speaker’s lapel, and a hands-free microphone was placed on the
car visor. Fig. 5 is one of the spectrums from Fig. 4 at 1.15-s
time frame. The solid line represents speech recorded by the
close-talking microphone, the dashed line corresponds to speech
recorded by the hands-free microphone. Fig. 5 (to