声音耳朵小波下载_在线阅读_11

首页 > 声音耳朵小波

is_562665

暂无简介

声音耳朵小波 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 6, AUGUST 2011 1791 An Auditory-Based Feature Extraction Algorithm for Robust Speaker Identification Under Mismatched Conditions Qi Li, Senior Member, IEEE, and Yan Huang, Member, IEEE Abs...

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 6, AUGUST 2011 1791 An Auditory-Based Feature Extraction Algorithm for Robust Speaker Identification Under Mismatched Conditions Qi Li, Senior Member, IEEE, and Yan Huang, Member, IEEE Abstract—An auditory-based feature extraction algorithm is presented. We name the new features as cochlear filter cepstral coefficients (CFCCs) which are defined based on a recently de- veloped auditory transform (AT) plus a set of modules to emulate the signal processing functions in the cochlea. The CFCC features are applied to a speaker identification task to address the acoustic mismatch problem between training and testing environments. Usually, the performance of acoustic models trained in clean speech drops significantly when tested in noisy speech. The CFCC features have shown strong robustness in this kind of situation. In our experiments, the CFCC features consistently perform better than the baseline MFCC features under all three mismatched testing conditions—white noise, car noise, and babble noise. For example, in clean conditions, both MFCC and CFCC features perform similarly, over 96%, but when the signal-to-noise ratio (SNR) of the input signal is 6 dB, the accuracy of the MFCC features drops to 41.2%, while the CFCC features still achieve an accuracy of 88.3%. The proposed CFCC features also compare favorably to perceptual linear predictive (PLP) and RASTA-PLP features. The CFCC features consistently perform much better than PLP. Under white noise, the CFCC features are significantly better than RASTA-PLP, while under car and babble noise, the CFCC features provide similar performances to RASTA-PLP. Index Terms—Auditory-based features, automatic speaker recognition (ASR), cochlea, feature extraction algorithm, robust speaker recognition, speaker identification. I. INTRODUCTION F EATURE extraction is the first crucial component in auto-matic speech processing. Generally speaking, successful front-end features should carry enough discriminative informa- tion for classification or recognition, fit well with the back-end modeling, and be robust with respect to the changes of acoustic environments. To the best of our knowledge, obtaining a sat- isfactory system performance under various operating modes still remains problematic, especially when acoustic training and testing environments are mismatched. Since the human hearing Manuscript received July 29, 2009; revised January 06, 2010; accepted De- cember 09, 2010. Date of publication December 23, 2010; date of current ver- sion June 01, 2011. This work was supported by the U.S. AFRL under the Con- tract FA8750-08-C0028. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Richard C. Rose. Q. Li is with Li Creative Technologies, Inc., Florham Park, NJ 07932 USA (e-mail: qili@ieee.org). Y. Huang was with Li Creative Technologies, Inc., Florham Park, NJ 07932 USA. She is now with Microsoft Corporation, Redmond, WA 98052 USA (e-mail: yanhuang@ieee.org). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2010.2101594 system is robust to the mismatched conditions, we propose an auditory-based feature extraction algorithm that is modeled on the basic signal processing functions in the ear. The proposed al- gorithm is also based on our recently developed auditory-based time–frequency transform named Auditory Transform (AT) [1], [2]. The features generated from the proposed algorithm are named cochlear filter cepstral coefficients (CFCCs). A. Traditional Speech Feature Extraction and the Fourier Analysis At a high level, most speech feature extraction methods fall into the following two categories: modeling the human voice production system or modeling the peripheral auditory system. For the first approach, one of the most popular features is a group of cepstral coefficients derived from linear prediction known as the linear prediction cepstral coefficients (LPCCs) [3], [4]. The LPCC feature extraction utilizes an all-pole filter to model the human vocal tract with speech formants captured by the poles of the all-pole filter. The narrow band (e.g., up to 4 kHz) LPCC features work well in a clean environment. How- ever, in our previous experiments, the linear predictive spectral envelope shows large spectral distortion in noisy environments [5], [6]. This results in significant performance degradation. For the second approach, there are two groups of features, based on either Fourier transforms (FTs) or auditory-based transforms. Representative for the first group are the Mel frequency cepstral coefficients (MFCCs), where a fast Fourier transform (FFT) is applied to generate the spectrum in the linear scale, and then a bank of band-pass filters is placed along a Mel frequency scale on top of the FFT output [7]. Alternatively, the FFT output is warped to a Mel or Bark scale and then a bank of band-pass filters is placed linearly on top of the warped FFT output [5], [6]. The proposed algorithm in this paper belongs to the second group, where the auditory-based transform is defined as an invertible, time–frequency transform. The output from this kind of transform can be in any kind of frequency scale (e.g., linear, Bark, ERB, etc.). Therefore, there is no need to place the bandpass filter in a Mel scale as in the MFCC or warp the frequency distributions as in [5], [6]. The MFCC features [7] in the first group are one of the most popular features for speech and speaker recognition. Like the LPCC features, the MFCC features perform well in clean environments but not in adverse environments or mismatched training and testing conditions. Perceptual linear predictive (PLP) analysis is another peripheral auditory-based approach. Based on the FFT output, it uses several perceptually motivated 1558-7916/$26.00 © 2010 IEEE 1792 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 6, AUGUST 2011 transforms, including Bark frequency, equal-loudness preem- phasis, and cubic-root amplitude compression [8]. The relative spectra, known as RASTA, is further developed to filter the time trajectory to suppress constant factors in the spectral component [9]. It is often cascaded with the PLP feature extraction to form the RASTA-PLP features. Comparisons between MFCC and RASTA-PLP have been reported in [10]. Further comparisons with the proposed CFCC features in experiments will be given at the end of this paper. Both MFCC and RASTA-PLP features are based on the Fourier transform (FT). As mentioned above, the FT has a fixed time–frequency resolution and a well-defined inverse transform. Fast algorithms exist for both the forward transform and the inverse transform. Despite its simplicity and efficient computation algorithms, we believe that when applied to speech processing the time–frequency decomposition mechanism of the FT is different than the mechanism in the hearing system. First, it uses fixed-length windows, which generate pitch har- monics over the entire speech bands. Second, its individual frequency bands are distributed linearly, which is different from the distribution in the human cochlea. Further wrapping is needed to convert to the Bark, MEL, or other scales. Finally, in our recent study in [1], [2], we observed that the FFT spec- trogram has more noise distortion and computation noise than an auditory-based transform which we recently developed. One of the examples is shown in Fig. 4. Thus, we find it necessary to develop a new feature extraction algorithm based on the new auditory-based, time–frequency transform [2] to replace the FT in speech feature extraction. B. Auditory-Based Time–Frequency Analysis The traveling wave of the basilar membrane (BM) in the cochlea and its impulse response have been observed and reported in the literature, such as [11]–[17]. Moreover, the BM tuning and auditory filters have also been studied in the literature [18]–[23]. Many electronic and mathematic models have been defined to simulate the traveling wave, the auditory filters, and the frequency responses of the BM [24]–[34]. Based on the study of the human hearing system, Li pro- posed an auditory-based, time–frequency transform (AT) in [1], [2]. The new transform is comprised of a pairing of a forward transform and an inverse transform. Through the forward trans- form, the speech signal can be decomposed into a number of frequency bands using a bank of cochlear filters. The frequency distribution of the cochlear filters is similar to the one in the cochlea and the impulse response of the filters is similar to that of the travelling wave. Through the inverse transform, the orig- inal speech signal can be reconstructed from the decomposed bandpass signals. In [2], Li has presented the proof of the inverse transform of the AT and validated the inverse AT in experiments. Compared to the FFT, the AT has flexible time–frequency res- olution and its frequency distribution can take on any linear or nonlinear form. Therefore, it is easy to define the distribution to be similar to that of the Bark, Mel, or ERB scale, which is sim- ilar to the frequency distribution function of the Basilar mem- brane. Most importantly, the proposed transform has significant advantages in noise robustness and can be free of the pitch har- monic distortion as plotted in [2] and Fig. 4. Therefore, the AT Fig. 1. Schematic diagram of the proposed auditory-based feature extraction algorithm named CFCCs. provides a new platform for feature extraction research. It forms the foundation for our robust feature extraction algorithm. In summary, the ultimate goal of this study is to develop a practical, front-end speech feature extraction algorithm that conceptually emulates the human peripheral hearing system and uses the concept to achieve an improved noise robustness per- formance under mismatched training and testing conditions. The remainder of this paper is organized as follows: Section II demonstrates the proposed auditory feature extraction algo- rithm and provides an analytic study and discussion; Section III studies the feature parameters using a development dataset and presents the experimental results of the proposed CFCC in comparison to other front-end features in a testing dataset; finally, Section IV concludes the paper. II. PROPOSED AUDITORY-BASED FEATURE EXTRACTION ALGORITHM This section describes the structure of the proposed audi- tory-based feature extraction algorithm and provides details of its computation. Although we would like to emulate the human peripheral hearing system, the computational aspects must meet the requirements of real-time applications; therefore, we will simulate only the most important features of the human periph- eral hearing system. An illustrative block diagram of the proposed algorithm is shown in Fig. 1. The proposed algorithm is intended to con- ceptually replicate the hearing system at a high level and con- sists of the following modules: auditory transform implemented by a cochlear filter bank, hair-cell function with windowing, cubic-root nonlinearity, and discrete cosine transform (DCT). A detailed description of each module follows. A. Auditory Transform The auditory transform in Fig. 1 is the forward transform of a pair of invertible auditory-based transforms, as defined and described by Li in [2]. It can be implemented as a filter bank. As the foundation of the auditory-based feature extraction al- gorithm, we use the forward auditory transform to replace the fast Fourier transform used in many other features. The auditory transform models the traveling wave in the cochlea where the sound waveform is decomposed into a set of subband signals. Let be a speech signal. A transform of with re- spect to a cochlear filter , representing the basilar membrane (BM) impulse response in the cochlea, is defined as (1) LI AND HUANG: AUDITORY-BASED FEATURE EXTRACTION ALGORITHM FOR ROBUST SPEAKER IDENTIFICATION 1793 where denotes the convolution operation, and are real, both and belong to , and representing the traveling waves in the BM is the decomposed signal and filter output. The above equation can also be written as (2) where (3) Like in the wavelet transform, the factor is a scale or dilation variable. By changing , we can shift the central frequency of to receive a band of decomposed signals. Factor is a time shift or translation variable. For a given value of , factor shifts the function by an amount along the time axis. Note that is an energy normalizing factor. It ensures that the energy stays the same for all and ; therefore, we have (4) The cochlear filter, as the most important part of the transform, is defined as (5) where and , is the unit step function; i.e., for and 0 otherwise. Parameters and deter- mine the shape and width of the cochlear filter in the frequency domain. They can be empirically optimized as shown in our ex- periments in Section III. The value of should be selected such that (6) is satisfied as follows: (6) This is required by the transform theory to ensure no informa- tion is lost during the transform [2]. The value of can be deter- mined by the current filter central frequency and the lowest central frequency in the cochlear filter bank (7) Since we contract with the lowest frequency along the time axis, the value of is in the range . If we stretch , the value of would be constrained to . The frequency distribution of the cochlear filter can be in the form of linear or nonlinear scales such as ERB (equivalent rectangular band- width) [26], Bark [35], Mel scale [7], or log. For a particular band number , the corresponding value of is represented as , which needs to be pre-calculated for the required central fre- quency of the cochlear filters at band number . Fig. 2 shows the impulse responses for five cochlear filters plotted using (5), which are similar to the psychoacoustic exper- Fig. 2. Impulse responses of the BM in the auditory transform (AT) when � � � and � � ��, plotted by (5). The labels on the far left of each subplot represent the central frequency of the plotted impulse response. They are very similar to psychological measurements, such as the figures in [11], [12], [36, Fig. 1.12], [13], etc. Fig. 3. Frequency responses of the cochlear filters when � � �: (a) � � ��; and (b) � � ��. The filter band width can be adjusted by � for different applications. iment results, such as the impulse responses plotted in [11], [13]. Fig. 3 shows the corresponding frequency responses. Normally, we use . The value of controls the filter band width; i.e., the Q-factor. This makes our auditory transform (AT) dif- ferent than the Gammatone function [37] in which the Q-factor is fixed. We note that the inverse transform of the above transform ex- ists. It has been proven mathematically and validated experi- mentally [2]. This property ensures that the forward transform implemented by the cochlear filter bank can avoid any informa- tion loss and thus qualifies as a platform for feature extraction. B. Other Operations The cochlear filter bank is intended to emulate the impulse response in the cochlea. However, there are other operations 1794 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 6, AUGUST 2011 in the ear. The inner hair cells act as a transducer for mechan- ical movements of the BM into neural activities. When the BM moves up and down, a shearing motion is created between the BM and the tectorial membrane [36]. It causes the displacement of the uppermost hair cells which generates the neural signals. However, the hair cells only generate the neural signals in one direction of the BM movement. When the BM moves in the op- posite direction, there is neither excitation nor neuron output. We studied different implementations of the hair cell function. The following function of the hair cell output provides the best performance in our evaluated task: (8) where is the filter-bank output from (1). Here, we as- sume that all other detailed functions in the outer ear, middle ear, and the control of the neural system to the cochlea have been ignored or have been included in the auditory filter responses. In the next step, the hair cell output for each band is converted into a representation of nerve spike count density. The duration of the count can be associated with the current band central fre- quency. We use the following equation to mimic the concept: (9) where ms is the window length, is the period of the central frequency of the th band, and ms is the window shift duration. We empirically set the system pa- rameters, but they may need to be adjusted for different datasets. Instead of using a fixed length window as in the FFT, we are using a variable length window for different frequency bands. The higher the frequency, the shorter the window. This prevents the high-frequency information from being smoothed out by a long window duration. The output of the above equation and the spectrogram of the cochlear filter bank can be used for both fea- ture extraction and analysis. Furthermore, we apply the scales of loudness function sug- gested by Stevens [38], [39] to the hair cell output as (10) In the last step, the discrete cosine transform (DCT) is applied to decorrelate the feature dimensions and to generate the CFCCs, so the features can work with the existing back-end. C. Analysis and Comparison We provide a comparative analysis of the auditory transform (AT) and the well-known Fourier transform (FT), and then ex- tend the comparison to the features derived from the AT, such as the CFCCs, and from the FT, such as the MFCCs. The anal- ysis and discussion are intended to help the reader understand the CFCCs. Further comparisons will be made in Section III. 1) Comparison Between AT and FT: The FFT is the major tool for the time–frequency transform used in speech signal pro- cessing. We use Fig. 4 to illustrate the differences between the spectrograms generated from the Fourier transform and our au- ditory transform [2]. The original speech wave file is recorded from a male voice. We then calculated the FFT spectrograms as shown in Fig. 4(a), with 30-ms Hamming window shifting Fig. 4. Comparison of FT and AT spectrums: (a) The FFT spectrogram of a male voice “2 0 5,” warped into the Bark scale from 0 to 6.4 Barks (0 to 3500 kHz). (b) The spectrogram from the cochlear filter output for the same male voice. The proposed AT is harmonic free and has less noise. every 10 ms. To facilitate the comparison, we then warped the frequency distribution from linear scale to the Bark scale using the method in [6]. The spectrogram of our auditory transform is shown in Fig. 4(b). It was generated from the output of the cochlear filter bank as defined in (5) and uses a window of fixed duration to compute the average densities for each band. In comparing the two spectrograms in Fig. 4, we can observe that there are no pitch harmonics and there is less computational noise in the spectrums generated from the auditory transform. In addition, all formant information has been kept. This is due to the vari- able length of cochlear filters and the selection of parameter in (5). The harmonics in FFT spectrogram are due to the fixed window length for all frequency bands. Furthermore, we compared the spectrums shown in Fig. 5. A male voice was recorded in a moving car using two different microphones. A close-talking microphone was placed on the speaker’s lapel, and a hands-free microphone was placed on the car visor. Fig. 5 is one of the spectrums from Fig. 4 at 1.15-s time frame. The solid line represents speech recorded by the close-talking microphone, the dashed line corresponds to speech recorded by the hands-free microphone. Fig. 5 (to

本文档为【声音耳朵小波】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。

声音耳朵小波

热门搜索

历史搜索