[IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota...

6
Classification of Speaker Accent using Hybrid DWT-LPC Features and K-Nearest Neighbors in Ethnically Diverse Malaysian English Yusnita MA 1,2,a , Paulraj MP 2,b , Sazali Yaacob 2,c and Shahriman AB 2 1 Faculty of Electrical Engineering, Universiti Teknologi MARA, 13500 Permatang Pauh, P.Pinang, Malaysia 2 School of Mechatronic Engineering, Universiti Malaysia Perlis, 02600 Ulu Pauh, Malaysia a [email protected] , b [email protected] , c [email protected] Abstract—Accent is a major cause of variability in automatic speaker-independent speech recognition systems. Under certain circumstances, this event introduces unsatisfactory performance of the systems. In order to circumvent this deficiency, accent analyzer in preceding stage could be a smart solution. This paper proposes a rather new approach of hybrid way to optimize the extraction of accent from speech utterances over other facets using linear predictive coefficients (LPC) derived from discrete wavelet transform (DWT). The constructed features were used to model an accent recognizer, implemented based on K-nearest neighbors. Experimental results showed that the hybrid dyadic-X DWT-LPC features were highly correlated to the Malay, Chinese and Indian accents of Malaysian English speakers through an increase of classification rate of 9.28% over the conventional LPC method. Keywords Discrete Wavelet Transform; Linear Predictive Coding; Accent Classification; K-nearest Neighbors; Malaysian English I. INTRODUCTION As our country, Malaysian has been populated by many different ethnics, there arises different accents of pronunciation of English in their speaking [1, 2], inherited from their own mother tongues’ phoneme inventory. This complexity presents the most critical issue for the state-of- the-art automatic speaker-independent speech recognition (SI-ASR) systems. As far as being concerned, there is no special industrial ASR available in the market for Malaysian English (MalE) to tackle the deviation due to this nonuniform variation among the population. Thus, under certain circumstances, this event introduces unsatisfactory performance of the systems. There are basically two ways of handling accented speech from the past literature namely through acoustic model adaptation and pronunciation adaptation. There is a large volume of published studies [3-6] using acoustic model adaptation either accent-dependent recognizer or global recognizer. In contrast, some researchers reported using pronunciation variation dictionary [7-9] to handle accented speech for ASRs. A local researcher [10] adapted pronunciation modeling through phone adaptation and pronunciation generalization for MalE on CMU SPHINX ASR. However disparate pronunciations due to idiosyncratics and accents triggers the need of increasing size of pronunciations in the dictionary. This turns out to increase the decoding time and add more confusions. Incorporating accent analyzer is more practical and had proven to greatly improve ASRs performance [4, 6]. Numerous studies have attempted to use short-time fourier transform (STFT) domain to extract accent features such as in filter banks analysis, mel-frequency cepstral coefficient (MFCC), perceptually linear predictive [4, 8, 11-13], parametric autoregressive (AR) model such as linear predictive coding (LPC) and formant analysis [3, 7, 14, 15]. Others employed temporal features such as pitch contour and energy [7, 16]. STFT has substantial use in accent and speech recognition but the drawback of this method is limited precision fixed by the window size. Even if short-time speech frames of 20 msec to 30 msec are considered quasi-stationary, in reality they contain several different phonemes information where a particular accent can be embedded. The shift towards a multi-resolution paradigm of decomposing the salient information has elevated discrete wavelet transform (DWT) to be useful in speech and accent recognition. An attempt for phoneme classification using mel-scale like wavelet packet tree structure was proposed by [17] to outperform the popular MFCC using linear discriminant analysis. In [18], MFCC was determined from wavelet subbands for speaker identification using hidden markov models to increase robustness in noisy environment. Comparison of MFCC, wavelet packet and perceptually bark scaled wavelet was reported in [19] for robust ASR. Nehe [20] demonstrated new features based on predictive coefficients derived from DWT and wavelet packet subbands for efficient modeling and improved accuracy for isolated-word ASR on NIST T1-45 database. There have been a limited number of applications of DWT to the field of accent classification indeed in the aforementioned literature. It is the aim of this paper to propose this rather new approach of hybrid way to optimize the extraction of accent from speech utterances over other facets using LPC derived from DWT and examine the features with a simple distance measure K- nearest neighbors (KNN) classifier. We hypothesize that different DWT subband levels require different amount of predictive coefficients as much as the reduced frequency precisions as the level increases. By this way the feature size of hybrid DWT-LPC retains as that of the original LPC but the drawback of uniform weighting to the whole spectrum [21] can be overcome by the proposed algorithm in classifying ethnically diverse accents of MalE. This idea of multi-numbered subband coefficients was also 978-1-4673-3033-6/10/$26.00 ©2012 IEEE 2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, Kota Kinabalu Malaysia 179

Transcript of [IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota...

Page 1: [IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota Kinabalu, Sabah, Malaysia (2012.12.3-2012.12.4)] 2012 International Symposium on Computer

Classification of Speaker Accent using Hybrid DWT-LPC Features and K-Nearest Neighbors in

Ethnically Diverse Malaysian English

Yusnita MA1,2,a, Paulraj MP2,b, Sazali Yaacob2,c and Shahriman AB2 1Faculty of Electrical Engineering, Universiti Teknologi MARA, 13500 Permatang Pauh, P.Pinang, Malaysia

2School of Mechatronic Engineering, Universiti Malaysia Perlis, 02600 Ulu Pauh, Malaysia [email protected], [email protected], [email protected]

Abstract—Accent is a major cause of variability in automatic speaker-independent speech recognition systems. Under certain circumstances, this event introduces unsatisfactory performance of the systems. In order to circumvent this deficiency, accent analyzer in preceding stage could be a smart solution. This paper proposes a rather new approach of hybrid way to optimize the extraction of accent from speech utterances over other facets using linear predictive coefficients (LPC) derived from discrete wavelet transform (DWT). The constructed features were used to model an accent recognizer, implemented based on K-nearest neighbors. Experimental results showed that the hybrid dyadic-X DWT-LPC features were highly correlated to the Malay, Chinese and Indian accents of Malaysian English speakers through an increase of classification rate of 9.28% over the conventional LPC method.

Keywords Discrete Wavelet Transform; Linear Predictive Coding; Accent Classification; K-nearest Neighbors; Malaysian English

I. INTRODUCTION As our country, Malaysian has been populated by many

different ethnics, there arises different accents of pronunciation of English in their speaking [1, 2], inherited from their own mother tongues’ phoneme inventory. This complexity presents the most critical issue for the state-of-the-art automatic speaker-independent speech recognition (SI-ASR) systems. As far as being concerned, there is no special industrial ASR available in the market for Malaysian English (MalE) to tackle the deviation due to this nonuniform variation among the population. Thus, under certain circumstances, this event introduces unsatisfactory performance of the systems.

There are basically two ways of handling accented speech from the past literature namely through acoustic model adaptation and pronunciation adaptation. There is a large volume of published studies [3-6] using acoustic model adaptation either accent-dependent recognizer or global recognizer. In contrast, some researchers reported using pronunciation variation dictionary [7-9] to handle accented speech for ASRs. A local researcher [10] adapted pronunciation modeling through phone adaptation and pronunciation generalization for MalE on CMU SPHINX ASR. However disparate pronunciations due to idiosyncratics and accents triggers the need of increasing size of pronunciations in the dictionary. This turns out to

increase the decoding time and add more confusions. Incorporating accent analyzer is more practical and had proven to greatly improve ASRs performance [4, 6].

Numerous studies have attempted to use short-time fourier transform (STFT) domain to extract accent features such as in filter banks analysis, mel-frequency cepstral coefficient (MFCC), perceptually linear predictive [4, 8, 11-13], parametric autoregressive (AR) model such as linear predictive coding (LPC) and formant analysis [3, 7, 14, 15]. Others employed temporal features such as pitch contour and energy [7, 16]. STFT has substantial use in accent and speech recognition but the drawback of this method is limited precision fixed by the window size. Even if short-time speech frames of 20 msec to 30 msec are considered quasi-stationary, in reality they contain several different phonemes information where a particular accent can be embedded. The shift towards a multi-resolution paradigm of decomposing the salient information has elevated discrete wavelet transform (DWT) to be useful in speech and accent recognition. An attempt for phoneme classification using mel-scale like wavelet packet tree structure was proposed by [17] to outperform the popular MFCC using linear discriminant analysis. In [18], MFCC was determined from wavelet subbands for speaker identification using hidden markov models to increase robustness in noisy environment. Comparison of MFCC, wavelet packet and perceptually bark scaled wavelet was reported in [19] for robust ASR. Nehe [20] demonstrated new features based on predictive coefficients derived from DWT and wavelet packet subbands for efficient modeling and improved accuracy for isolated-word ASR on NIST T1-45 database.

There have been a limited number of applications of DWT to the field of accent classification indeed in the aforementioned literature. It is the aim of this paper to propose this rather new approach of hybrid way to optimize the extraction of accent from speech utterances over other facets using LPC derived from DWT and examine the features with a simple distance measure K-nearest neighbors (KNN) classifier. We hypothesize that different DWT subband levels require different amount of predictive coefficients as much as the reduced frequency precisions as the level increases. By this way the feature size of hybrid DWT-LPC retains as that of the original LPC but the drawback of uniform weighting to the whole spectrum [21] can be overcome by the proposed algorithm in classifying ethnically diverse accents of MalE. This idea of multi-numbered subband coefficients was also

978-1-4673-3033-6/10/$26.00 ©2012 IEEE

2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, KotaKinabalu Malaysia

179

Page 2: [IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota Kinabalu, Sabah, Malaysia (2012.12.3-2012.12.4)] 2012 International Symposium on Computer

Normalization Preemphasis Frame blocking

Windowing Autocorrelation LPC

Speech

LPC coefficients

Figure 1. Block diagram of LPC speech analysis.

previously utilized by [22] to derive their proposed discrete wavelet coefficients calculated from mel-scaled log filterbank energies, unlike [20] in which a uniform amount of coefficients was generated from the DWT and wavelet packet subbands.

This paper is organized as follows. In section II, a brief description about experimental setup and speech database is presented. Section III and IV describe methodologies used for extracting speech features and accent modeling. The parameter settings and findings are discussed in section V. Lastly, section VI concludes the important findings of this paper.

II. EXPERIMENTAL DATABASE The experiments were conducted on a local MalE

database developed by our research group. For the analysis purpose of this work, we took speech corpus recorded from 42 female volunteers of three main ethnics. It was composed of 15 Malay, 15 Chinese and 12 Indian female speakers. We faced a shortage of the Indian speakers. The utterances were selected from three most accent sensitive words of MalE as found in our previous analysis. Those words are bottom, aluminum and target and each word was replicated five times for each speaker. This collection of utterances amounted to 630 speech samples. The speakers were originated from various north, south, west and east regions of the country and as such they were also influenced by their regional accents. The average duration of each word was about 0.5 sec to 0.7 sec and was auto-segmented prior to the analysis. Subjects were postgraduate students of Universiti Malaysia Perlis aged from 18 to less than 40 years. The recording was carried in a semi-anechoic acoustic chamber using a handheld condenser, supercardioid and unidirectional microphone. The background noise in that room was recorded approximately 22 dB. This level is considered very quiet and controlled as compared to normal quiet room about 40 to 50 dB. The speech was recorded using a laptop computer sound card and MATLAB program where the sampling rate was set to 16 kHz and bit resolution was set to 16 bps.

III. FEATURE EXTRACTION In this section we describe the methods used in the

front-end processing end of this accent classification system and also other details.

A. Linear Predictive Coding LPC analysis models the speech signal as a p-order AR

system which is a special case of all-pole IIR filter. This filter is commonly used to model human vocal tract as LTI system over short intervals. The underlying principle to this technique is the output of the system is predicted as linear combination of past output samples assuming that input is unknown [23] and the equation is expressed in (1). The block diagram of LPC procedure is summarized in Fig. 1.

The LPC coefficients are extracted from the pre-processed speech using (2) through (4). The estimated future speech samples from a linearly weighted summation of past p-samples using method of least squares is as the following:

.)kn(x)k(a)n(x̂p

1k =

−−= (1)

where x(.) and (.)x̂ are speech samples and their estimates and a(k)=[a(1),a(2),…a(p)]T is the LPC parameters and p is the linear predictive filter order.

The autocorrelation function (ACF) of each frame can be computed [24] as in (2).

.)in(x)n(x)i(ri1N

0n

FF−−

=

+= (2)

where r(i)=[r(0),r(1),r(2)…r(p)]T is the ACF of a frame samples denoted as xF(.) consists of N–sample points and i=0,1,2,…p.

Next, the Yule-Walker equations are solved in (3) using the Levinson-Durbin recursive algorithm to obtain the coefficients of the prediction filter.

.)i(r)ki(R)k(ap

1k =

−=− (3)

where R(i–k) forms the ACF matrix (p-by-p) which is actually a symmetric Toeplitz matrix and 1 i p. Hence, a(k) can be solved efficiently by taking the inverse of ACF matrix multiplied by ACF vector as obtained in (4).

.rRa 1−−= (4)

B. Discrete Wavelet Transform The DWT is an alternative to STFT which provides

multi-resolution analysis for analyzing different frequencies more accurately. The continuous wavelet transform (CWT) is defined as a summation over all time of the original signal multiplied by successive scaled, shifted versions of the wavelet function and is expressed mathematically as in (5).

( ) dtc

t).t(xc

1c,XWT−∗= τψτ (5)

where x(.) is the signal to be analyzed, (.) is the mother wavelet or the basis function, is the shifting parameter which relate to the position along the original signal, thus correlates to time information and c is the scaling parameter which corresponds to the frequency information. All the wavelet functions used in the transformation are derived from the mother wavelet through translation (shifting) and scaling (dilation or compression) operations. Opposite to the CWT, in DWT, the choice of scales and positions are based on power of two or well known as dyadic fashion. An efficient way to implement this is through the implementation of quadrature mirrors filters (QMF). The most popularly and

2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, KotaKinabalu Malaysia

180

Page 3: [IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota Kinabalu, Sabah, Malaysia (2012.12.3-2012.12.4)] 2012 International Symposium on Computer

Figure 3. Block diagram of the proposed feature extraction of hybrid DWT-LPC approach.

Figure 2. Three-level wavelet decomposition tree in dyadic

fashion.

Figure 4. The structure of the combination using dyadic-X of LPC

order from DWT subbands.

successfully used wavelet is Daubechies, named after the inventor i.e. Ingrid Daubechies. It is a compactly-supported orthonormal wavelets family [25].

The sub-band coding implementation as termed QMF above that is used to do the multi-resolution analysis can be depicted as in Fig. 2. Since in most of the signal, low frequency part is the most important, at each level, the approximation is decomposed into lower-resolution components. H0 is the high-pass filter to decompose detail components i.e. d[n] while G0 is the low-pass filter to decompose the approximation signal i.e. a[n] [26]. For more information on DWT, interested readers may refer [25-28].

C. The Complete Proposed System The proposed system for feature extraction process is

depicted in Fig. 3 and described below.

After being converted from analog to digital signal by the ADC block, the speech data was initially zero-adjusted to remove a DC bias during recording. Following that, the raw data was normalized using mean-variance normalization in order to compensate any loudness inconsistency among the speakers. The pre-emphasis filter was applied on the speech signal using a first-order high-pass FIR filter 1z9375.01)z(H −−= to compensate the attenuation in spectral energy approximately by 6 dB per octave. Next, the speech was frame-blocked into 512-point frames with 256-point frame step size between consecutive frames. The last step in this pre-processing was to apply Hamming windows to the speech frames with a function defined in (6).

( ).)1N/(n2cos46.054.0)n(w −−= π (6) where n is the sample points ranging

from 1Nn0 −≤≤ and N is the frame or window size. The frame signals were then decomposed into three-

level subbands of decomposition tree using DWT. The wavelet function used in this experiment was Daubechies of third-order. The signal constituents of each frame signal was transformed into DWT coefficients i.e. approximation level-3 (cA3), detail level-3 (cD3), detail

level-2 (cD2) and detail level-1 (cD1) were taken as subbands for further extraction. In the final stage, LPC coefficients were determined from each aforementioned subbands and all subbands’ LPC coefficients were compiled as input feature vectors. Since each decomposition level halves the frequency contents in the subbands, for dyadic-X DWT-LPC, the number of predictive coefficients was divided by the scales. The dyadic way of decomposing is as the following:

First level : 21= 2 scales Second level : 22 = 4 scales Third level : 23 = 8 scales As the size of subbands decreases, the time resolution

increases to more precisely represent the high frequency information. Fig. 4 shows the structure of subbands selection and combination for dyadic way to code the frame signal. While for uniform-X DWT-LPC, regardless of their contents, all subbands’ information were extracted into an equal number of coefficients. Either type of the DWT-LPC features were obtained by concatenating starting from the deepest level of the decomposition.

IV. ACCENT MODELING USING K-NEAREST NEIGHBORS

The K-nearest neighbors (KNN) prediction of an unknown pattern i.e. query instance is based on a very simple majority vote of the categories or classes of the nearest neighbors in the training space. The underlying principle is based on minimum distances from the unlabeled sample to the training samples to determine the nearest K-neighbors. Euclidean distance is one of the popular methods used. This distance calculation from one sample or pattern in the testing dataset which contains the unknown patterns, to one of the samples in the training dataset with known class labels is expressed in (7).

[ ] .)m(x)m(xdM

1m

2jiij

2

=

−= (7)

where xi(.) and xj(.) are exemplars of the training and the testing datasets in the mth feature dimension i.e. m=1,2,…M.

The algorithm on how KNN works [29] is summarized as flowchart in Fig. 5.

V. RESULTS AND DISCUSSION In this section, the proposed DWT-LPC types feature-

based accent recognizers were tested against two baseline features i.e. the conventional LPC and the state-of-the-art MFCC. The accent correlates were measured using classification rates (CRs) percentage in examining how well KNN model can recognize the tested speech sample.

2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, KotaKinabalu Malaysia

181

Page 4: [IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota Kinabalu, Sabah, Malaysia (2012.12.3-2012.12.4)] 2012 International Symposium on Computer

Figure 5. Summary of K-nearest Neighbors algorithm.

0 200 400 600-1

-0.5

0

0.5

1Approximation A3

Am

plitu

de

0 200 400 600-2

-1

0

1

2Detail D1

0 200 400 600-1

-0.5

0

0.5

1Detail D2

sample number

Am

plitu

de

0 200 400 600-1

-0.5

0

0.5

1Detail D3

sample number

Figure 6. Displaying the three-level decomposition signals of approximation and detail subbands of the word bottom (the 21st

frame of the first sample) from a Malay female speaker labeled as Spk018.

0 100 200 300 400 500 600-2

0

2

4Original Signal

Am

plitu

de

0 100 200 300 400 500 600-2

0

2

4Reconstructed Signal

sample number

Am

plitu

de

Figure 7. Displaying the original and reconstructed signals for the word bottom, the 21st frame, 1st sample from Spk018.

5 10 15 20 25 30 3570

75

80

85

90

95

Dimension size

Mea

n C

lass

ifica

tion

Rat

e

Three-Class Accent Classification Rate

LPC

uniform-X DWT-LPCdyadic-X DWT-LPC

MFCC

Figure 8. Displaying the trends of mean classification rates for uniform-X DWT-LPC and dyadic-X DWT-LPC versus

conventional LPC features across different feature size for accent classification task.

Also, the results of three-level decomposition of approximation and detail signals of speech are analyzed here.

Fig. 6 shows the results of three-level decomposition of approximation and detail signals for a frame signal. The approximation is the high scale part which corresponds to the low frequency components of the signal whilst the details are the low scale parts which relate to the high frequency components. Although the decomposition of successive approximations into smaller speech constituents will refine the analysis, a few scales about 2-3 levels is sufficient [20] and to keep the size of feature vector of a sample as a compromise to the size of the dataset.

Fig. 7 shows the original frame signal and the reconstructed signal from the wavelet decomposition structure. This comparison shows that using three-level decomposition and Daubechies’s wavelet filters of third-order can intelligibly reproduce the original one.

After preparing the speech data into four datasets i.e. conventional LPC, uniform-X DWT-LPC, dyadic-X DWT-LPC and MFCC, these datasets were reshuffled and randomized prior to partitioning into two independent sets for training and testing. We employed 80% training and

used the remaining 20% for testing purpose to evaluate the accent recognizer. A complete run of the testing phase took 10 trials over which the results of overall three-class CRs were averaged. Two sets of different predictive coefficients (p-order) and mel-cepstral coefficients between 8 to 32 were tested. The cepstral coefficients computed for MFCC used 20 mel-filters. For the KNN classifier, Euclidean distance was used and a set of different K-value from 1 to 12 was experimented to search for optimum parameter. It was found that K=2 has resulted the best parameter for all datasets.

We analyzed the resulted CRs statistically by taking the mean and standard deviation of CRs. Fig. 8 illustrates the mean CRs for the uniform-X and dyadic-X DWT-LPC methods with respect to the baseline features namely LPC and MFCC methods across different feature size. For the uniform-X LPC, the size has to be multiplied by a factor of 4 as the coefficients are extracted equally from the four subbands (e.g. for p=8, the total number of predictive coefficients #LPC=8x4=32).

2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, KotaKinabalu Malaysia

182

Page 5: [IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota Kinabalu, Sabah, Malaysia (2012.12.3-2012.12.4)] 2012 International Symposium on Computer

Table 1 gives the detail description of #LPC in each subband of the dyadic-X type. Lesser frequency samples in the higher level of decomposition tree is the reason for the reduced #LPC required as in our hypotheses.

TABLE 1. Contribution of LPC coefficients in multi-resolution subbands according to the rule of dyadic-X of DWT-LPC features.

p-order Number of predictive coefficients (#LPC)cA3 cD3 cD2 cD1 Total

8 1 1 2 4 810 1 1 3 5 1012 2 2 3 6 1314 2 2 4 7 1516 2 2 4 8 1618 2 2 5 9 1820 3 3 5 10 2122 3 3 6 11 2324 3 3 6 12 2426 3 3 7 13 2628 4 4 7 14 2930 4 4 8 15 3132 4 4 8 16 32

Two interesting conclusions can be drawn from the

analysis of Fig. 8. Firstly, the trends show that for all p-order, our proposed DWT-LPC-based recognizers outperformed the conventional LPC recognizer. Secondly, globally LPC inclines to improve with increasing p-order but there are multiple peaks of averagely 85.45% located in between p=24 to 28. By interpretation, the higher p-order, the better the constructed model predicts the signal whilst the lower p-order model tracks the trend or envelope of the signal, ideally the vocal tract shape, giving a smooth spectrum [21]. Similarly, MFCC improves gradually after the 12-dimension and achieves the best CR of 93.81% at the 28-dimension. The uniform-X DWT-LPC behaves quite strangely from the original LPC in that the CR decreases from the 32-dimension to the 72-dimension and fluctuates around 90.07% for the rest of dimensions. The best hit rate was 93.25% at p=32. The dyadic-X DWT-LPC shows a substantial improvement from the 8-dimension to the 21-dimension and start to be slightly stagnant for the rest with an average of 93.38% starting at p=21. From the global shape, it is not overly stated that the dyadic-X DWT-LPC is a smooth and magnified version of the LPC with an improvement of 9.28% in the best hit rate. The uniform-X DWT-LPC has disadvantage over the others that the feature size increases fourfold of the p-order.

In Table 2, we summarize the best performance of the experimental studies of all feature types for this three-class accent task. All except LPC have almost the same CR but dyadic-X DWT-LPC is the most efficient due to the least size.

TABLE 2. Summary of the selected best performance for different features.

Feature type Mean Classification rate (%) Feature SizeLPC 85.63 28Uniform-X DWT-LPC 93.25 32Dyadic-X DWT-LPC 93.25 21MFCC 93.81 28

To complement the mean CRs of 10 iterations, the

standard deviations of CRs for all four feature types are tabulated in Table 3. In general, the spread of CRs of the LPC, uniform-X DWT-LPC, dyadic-X DWT-LPC and

MFCC are small namely 1.82%, 1.67%, 1.69% and 1.57% from their means respectively.

TABLE 3. Standard deviation of accent classification rates for different features and feature size.

Feature size (p-order or number of mel-cepstral)

Feature types LPC Uniform-X

DWT-LPC Dyadic-X DWT-LPC

MFCC

8 1.60 1.64 2.00 1.6810 2.16 1.24 1.87 1.5912 2.47 1.59 1.89 1.4614 1.80 1.55 1.44 1.6616 1.44 1.80 1.71 1.3618 2.19 1.76 1.53 1.5220 1.81 1.72 1.60 1.8322 1.89 1.86 1.76 1.8024 1.92 1.95 1.46 1.7526 1.46 1.95 1.56 1.4028 1.21 1.75 1.54 1.5330 1.94 1.44 1.90 1.1932 1.84 1.51 1.71 1.60

Next the detail properties of the mean CRs for all

feature types are shown in Table 4. The true-positive (hit)-rates in each class are augmented here from the specified parameters in as Table 2. The total samples in the test datasets for Malay, Chinese and Indian classes were 450, 450 and 360 respectively. In general the Malay accent has the highest hit rate for all feature types but Dyadic-X DWT-LPC has almost equally hit rates in all classes.

TABLE 4. Measurement properties of classification rate (CR) for different features in 20% test data set (10 trials). Property of CRs

Performance of Test datasets (%)LPC Uniform-X

DWT-LPC Dyadic-X DWT-LPC

MFCC

Malay accent

89.33 96.00 92.89 96.44

Chinese accent

82.67 91.33 93.78 94.00

Indian accent

84.72 92.22 93.06 90.28

VI. CONCLUSION This paper has presented a speaker accent classification

study utilizing hybrid DWT-LPC feature space using uniform and dyadic extraction of linear predictive coefficients. We have accomplished sets of comparisons of the proposed features with the baseline LPC and MFCC as the references. The DWT-LPC-based features with KNN recognizers have performed very well for female MalE speakers. Three-class classification of the Malay, Chinese and Indian accents has been implemented that represent three majority ethnics in Malaysia. The best overall CR averaged over 10 trials was 93.25% with 32- and 21-dimension space for the uniform- and dyadic- manner. It is worth noting that the uniform type of DWT-LPC caused fourfold feature size and behave unexpectedly with increasing p-order. On the other hand, the dyadic type DWT-LPC yielded an increase in CR by 9.28% with respect to the conventional LPC while retaining the feature size. Comparing to the state-of-the art MFCC features, it was at par in terms of the accuracy and more efficient in size. Furthermore, it had been discussed in [22] that MFCC was easily affected by the background noise. In the nutshell, we conclude that the dyadic-X

2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, KotaKinabalu Malaysia

183

Page 6: [IEEE 2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE) - Kota Kinabalu, Sabah, Malaysia (2012.12.3-2012.12.4)] 2012 International Symposium on Computer

DWT-LPC was highly correlated with the Malay, Chinese and Indian accents and very useful for the ethnically diverse MalE accents research.

ACKNOWLEDGMENT The authors would like to acknowledge the support and

encouragement given by the Vice Chancellor of University Malaysia Perlis (UniMAP) that is Brig. Jeneral Dato’ Prof. Dr. Kamaruddin Hussin and the sponsorship in Ph.D candidature from the Ministry of Higher Education, Malaysia.

REFERENCES [1] K. McGee, "Attitudes towards accents of English at the British Council, Penang: What do the students want?," Malaysian Journal Of ELT Research (MELTA), vol. 5, pp. 162-205, 2009. [2] S. Nair Venugopal, "English, identity and the Malaysian workplace," in World Englishes. vol. 19 Oxford, UK: Blackwell Publishers Ltd., 2000, pp. 205-213. [3] S. Deshpande, S. Chikkerur, and V. Govindaraju, "Accent classification in speech," in Automatic Identification Advanced Technologies, 2005. Fourth IEEE Workshop on, 2005, pp. 139-143. [4] L. M. Arslan and J. H. L. Hansen, "Language accent classification in American English," Speech Communication, vol. 18, pp. 353-367, 1996. [5] D. Vergyri, L. Lamel, and J. L. Gauvain, "Automatic Speech Recognition of Multiple Accented English Data," in INTERSPEECH 2010, 2010. [6] C. Teixeira, I. Trancoso, and A. Serralheiro, "Accent identification," in Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, 1996, pp. 1784-1787 vol.3. [7] K. Liu Wai and P. Fung, "Fast accent identification and accented speech recognition," in Acoustics, Speech, and Signal Processing, 1999. ICASSP '99. Proceedings., 1999 IEEE International Conference on, 1999, pp. 221-224 vol.1. [8] J. J. Humphries, P. C. Woodland, and D. Pearce, "Using accent-specific pronunciation modelling for robust speech recognition," in Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, 1996, pp. 2324-2327 vol.4. [9] N. Seman and K. Jusoff, "Acoustic Pronunciation Variations Modeling for Standard Malay Speech Recognition," Computer and Information Science, vol. 1, p. P112, 2009. [10] B. H. A. Ahmed and P. Tan Tien, "Non-native Accent Pronunciation Modeling in Automatic Speech Recognition," in Asian Language Processing (IALP), 2011 International Conference on, pp. 224-227. [11] S. Dupont, C. Ris, O. Deroo, and S. Poitoux, "Feature extraction and acoustic modeling: an approach for improved generalization across languages and accents," in Automatic Speech Recognition and Understanding, 2005 IEEE Workshop on, 2005, pp. 29-34. [12] J. W. Picone, "Signal modeling techniques in speech recognition," Proceedings of the IEEE, vol. 81, pp. 1215-1247, 1993. [13] J. W. Pitton, W. Kuansan, and J. Biing-Hwang, "Time-frequency analysis and auditory modeling for automatic recognition of speech," Proceedings of the IEEE, vol. 84, pp. 1199-1215, 1996. [14] P.J. Ghesquiere and D. Van Compernolle, "Flemish accent identification based on formant and duration features," in Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, 2002, pp. I-749-I-752.

[15] M. A. Yusnita, M. P. Paulraj, S. Yaacob, S. A. Bakar, and A. Saidatul, "Malaysian English accents identification using LPC and formant analysis," in Control System, Computing and Engineering (ICCSCE), 2011 IEEE International Conference on, 2011, pp. 472-476. [16] H. Jue, L. Yi, T. F. Zheng, J. Olsen, and T. Jilei, "Multi-layered features with SVM for Chinese accent identification," in Audio Language and Image Processing (ICALIP), 2010 International Conference on, 2010, pp. 25-30. [17] O. Farooq and S. Datta, "Mel filter-like admissible wavelet packet structure for speech recognition," Signal Processing Letters, IEEE, vol. 8, pp. 196-198, 2001. [18] M. I. Abdalla and H. S. Ali, "Wavelet-Based Mel-Frequency Cepstral Coefficients for Speaker Identification using Hidden Markov Model," Journal of Telecommunications, vol. 1, March 2010. [19] H. R. Tohidypour, S. A. Seyyedsalehi, and H. Behbood, "Comparison between wavelet packet transform, Bark Wavelet & MFCC for robust speech recognition tasks," in Industrial Mechatronics and Automation (ICIMA), 2010 2nd International Conference on, pp. 329-332. [20] N. S. Nehe and R. S. Holambe, "New feature extraction methods using DWT and LPC for isolated word recognition," in TENCON 2008 - 2008 IEEE Region 10 Conference, 2008, pp. 1-6. [21] M. Rosell, "An introduction to front-end processing and acoustic features for automatic speech recognition," in Lecture Notes of School of Computer Science and Communication, KTH, Sweden, 2006. [22] Z. Tufekci and J. N. Gowdy, "Feature extraction using discrete wavelet transform for speech recognition," in Southeastcon 2000. Proceedings of the IEEE, 2000, pp. 116-123. [23] J. Makhoul, "Linear prediction: A tutorial review," Proceedings of the IEEE, vol. 63, pp. 561-580, 1975. [24] L. Rabiner and B. H. Juang, Fundamentals of speech recognition vol. 103: Prentice hall Englewood Cliffs, New Jersey, 1993. [25] M. Misiti, Y. Misiti, G. Oppenheim, and J. P. Michel, "Wavelet toolbox: for use with MATLAB," 1996. [26] M. Weeks, Digital Signal Processing Using MATLAB and Wavelets. Massachusetts: Infinite Science Press, 2006. [27] S. Mallat, A wavelet tour of signal processing: Academic Press, 1998. [28] A. Cohen, I. Daubechies, and J. C. Feauveau, "Biorthogonal bases of compactly supported wavelets," Communications on pure and applied mathematics, vol. 45, pp. 485-560, 2006. [29] K. Teknomo, "K-Nearest Neighbors Tutorial," 2011.

2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE 2012), December 3-4, 2012, KotaKinabalu Malaysia

184