[IEEE 2011 4th International Conference on Mechatronics (ICOM) - Kuala Lumpur, Malaysia...

2011 4th International Conference on Mechatronics (ICOM), 17-19 May 2011, Kuala Lumpur, Malaysia

978-1-61284-437-4/11/$26.00 ©2011 IEEE

Speech Compression using Compressive Sensing on a Multicore System

T.S. Gunawan, O.O. Khalifa, A.A. Shafie Faculty of Engineering

International Islamic University Malaysia 53100 Kuala Lumpur, Malaysia

[email protected]

E. Ambikairajah School of Electrical Engineering and Telecommunications

The University of New South Wales Sydney, NSW 2052, Australia [email protected]

Abstract—Compressive sensing (CS) is a new approach to simultaneous sensing and compression of sparse and compressible signals, i.e. speech signal. Compressive sensing is a new paradigm of acquiring signals, fundamentally different from uniform rate digitization followed by compression, often used for transmission or storage. In this paper, a novel algorithm for speech coding utilizing CS principle is developed. The sparsity of speech signals is exploited using gammatone filterbank and Discrete Cosine Transform (DCT) in which the compressive sensing principle is then applied to the sparse subband signals. All parameters will be optimized using informal listening test and Perceptual Evaluation of Speech Quality (PESQ). In order to further reduce the bit requirement, vector quantization using codebook of the training signals will be added to the system. The performance of overall algorithms will be evaluated based on the processing time and speech quality. Finally, to speed up the process, the proposed algorithm will be implemented in a multicore system, i.e. six cores, using Single Program Multiple Data (SPMD) parallel paradigm.

Keywords— Compressive sensing, speech coding, PESQ, listening test, multicore system, SPMD

I. INTRODUCTION Compressive sensing (CS) has gained great interest since

2006. It is a new paradigm of acquiring signals, fundamentally different from uniform rate digitization followed by compression, often used for transmission or storage [1-3]. The conventional method of sampling a signal at the Nyquist rate, i.e. twice the signal bandwidth, and then removing redundant information before efficient transmission or storage, requires a lot of signal processing at the transmitter (encoder), although the receiver (decoder) is relatively simple. CS provides both sampling and compression with the side effect of encryption of the source information, simultaneously. Using CS, signal reconstruction quality can be traded with the available processing power at the decoder side.

The fundamental theory of compressive sensing stated that signal recovery from random projections of a signal vector is possible provided the signal is known to be sparse in a vector space [3]. The sparse property is a measure of signal redundancy and CS permits to utilize this redundancy right at the signal acquisition stage, instead of subsequent stage of compression [1]. Furthermore, many practical signals, such as speech and audio signals do satisfy the sparse property in some linear transform domain of the signal.

Compressive sensing has been used in many applications, especially speech processing. CS has been used in noise reduction and speech denoising [4, 5] and speech coding [6]. However, as it is still a new paradigm, not much research has been done on the use of CS for speech and audio coding application with some rigorous evaluation. Therefore, the objective of this paper is to explore a new direction on speech coding based on compressive sensing and to evaluate its performance subjectively and objectively using listening test and Perceptual Evaluation of Speech Quality (PESQ). Furthermore, a multicore system with six cores will be utilized to speed up the encoding and decoding process.

II. COMPRESSIVE SENSING THEORY The compressive sensing theory presented in this section is

adapted from [7]. For clarity, let N��x be the speech signal, and let � �N�� ,,, 21 �� be the basis vectors spanning

N� . The speech signal is said to be sparse if

� � � ��

�T

iTnn Nnnns

ii1

21 ,,1,,,, ��x (1)

where ins are scalar coefficients and NT , i.e. ns or

simply s is the sparse vector with only T non-zero elements. Based on CS theory, we can do sampling of x through projections onto random bases and reconstruct the speech signal at the decoder with full knowledge of the random bases. In other words, the sampling (sensing) measurements can be defined as

� � � ��

�N

imm NMmixiy

1

1,� (2)

or �xy � , where � is a NM � measurement matrix. The � is made up of orthonormal random basis vector m� . If the incoherency between � and � are satisfied, then there is a high probability that y can be reconstructed perfectly if

� �NTM log� , as mentioned in [1]. Convex optimization then can be utilized as follows s�x��syss

s

ˆˆ and subject tominargˆ1 �� (3)

First author is also a Visiting Fellow at UNSW.

where 1� is the 1� norm. The algorithm above is also known as "basis pursuit" (BP) since a subset of the column vector of �� is being determined. One of the efficient algorithm to solve CS is "orthogonal matching pursuit" (OMP) [8] which can be formulated as follows: T�� 02 andminargˆ s��sys

s (4)

Many solutions to sparse approximation have been proposed, such as matching pursuit (MP), least absolute shrinkage and selection operator (LASSO), basis pursuit (BP), gradient pursuit (GP), in which its performance show some interdependence between the number of measurements, measurement noise, signal sparsity and the reconstruction algorithm itself [9, 10].

Because of the time varying nature of speech signal, sensing and compressing are applied on a short duration of the signal. It is known that the perceptually significant features of spectral resonances and the harmonicity due to periodic excitation, are the most important and basic parameters in speech and audio. Therefore, to explore sparsity of the speech signal, several alternative representation of a speech frame can be considered, such as

11�Cx �� (5)

21�Fx �� (6)

Eq. (5) describes a DCT where C is the real valued transform matrix and 1� are the DCT coefficients. Similarly, in Eq. (6), 2� corresponds to the DFT matrix F , which is complex valued. Hence, various transforms, such as Fast Fourier Transform (FFT), Discrete Cosine Transform (DCT), and Discrete Wavelet Transform (DWT) with various wavelet bases, can be used to sparsify the speech signal. The best transform that provides higher sparsity index can be selected by using Gini index as it provides the best measurement [11]. In this paper, to simplify the analysis, only DCT will be considered.

III. SPEECH COMPRESSION USING COMPRESSIVE SENSING

A. Proposed Algorithm The proposed speech compression algorithm using

compressive sensing is illustrated in Fig. 1. Of the various speech processing front ends, gammatone filter bank will be utilized due to its resemblance to the shape of human auditory filters [12]. Of the various transforms available to sparsify speech signal, DCT will be chosen due to its simplicity and its good decorrelation property [13].

In order to further reduce the bit requirement, vector quantization using codebook of the training signals is incorporated to the proposed system. If � �L21 ,h,h,hh �� is the codebook of size L , then the CS optimization problem in the time domain can be stated as:

� � rhxr�hryh,rhr

ˆˆˆ and,,minargˆˆ02

,�� T (7)

where the search is conducted over the cross product of matrix codebook and the basis vector set � . The signal produced is then transmitted to the receiver or decoder.

Fig. 1. Proposed Speech Compression Algorithm using Compressive Sensing

On the decoder part, for solving convex optimization of compressive sensing, the gradient projection for sparse reconstruction (GPSR) algorithm [9] was utilized due to its high accuracy and low complexity. As mentioned in [1], the reconstruction quality can be traded with the available processing power at the decoder side. The higher the processing power or the longer available to solve convex optimization problem, the higher the reconstructed signal quality. Furthermore, IDCT and delay compensation were applied to the compressed signal. The amount of filter delay accumulated by each subband is different and without compensating for this delay, the reconstruction of subband signal will lead to an incoherent output signal, i.e. lower quality signal.

B. Parallel Implementation on A Multicore System The hardware used for parallel implementation was a

multicore system with AMD Phenom II X6 2.80GHz (six cores processors), 12 Gb RAM, and 1.5 Gb hard disk. For the software part, it used Windows 7 64 bits and Matlab 2010b

with Parallel Computing Toolbox and Distributed Computing Server.

The available parallel programming models in Matlab Parallel Computing Toolbox are parallel-for (parfor), distributed array, single program multiple data (SPMD), and interactive parallel computation with pmode [14]. In this paper, single program multiple data (SPMD) parallel programming models was utilized as the parallel program written for multicores system can be extended easily into a cluster system with many computers to further speed up the computation.

Fig. 2. Parallel Implementation of the Proposed Algorithm

Fig. 2 shows parallel implementation of the proposed speech compression algorithm. Suppose that we have many speech signals, i.e. � �Nxxxx ,,,, 321 ��x , that will be processed individually. By distributing the tasks to N processors, it is expected that the speed up obtained will be N times faster, assuming that the coordination and communication times among processor is negligible and the tasks are equally distributed.

IV. EXPERIMENTS In this section, the speech signal datasets used will be

explained. The process of constructing codebook (vector quantization) will be described, and the performance of the proposed algorithm will be presented.

A. Datasets To evaluate the performance of proposed algorithms,

NOIZEUS datasets were used [15]. Thirty sentences from the IEEE sentence database were recorded in a sound-proof booth. The sentences were produced by three male and three female speakers and originally sampled at 25 kHz then downsampled to 8 kHz. The recorded signal number 1 to 10 and 21 to 25 were produced by male speakers, while the recorded signal 11 to 20 and 26 to 30 were produced by female speakers, respectively. The NOIZEUS datasets were utilized as it contains all phonemes in the American English language.

B. Codebook A codebook of h using the DCT representation of a speech

frame as described in Eq. (7) was constructed. The NOIZEUS datasets recording number 1 to 20 sampled at 8 kHz, i.e. x1 to x20, were used to train and develop the codebook. Hence, the datasets of 30 speech files were divided into two separate

training and testing files, i.e. 20 speech files were used for training and the other 10 speech files were used for testing. The speech signals were evaluated frame-by-frame with 32ms duration and analysed using the 17 subband gammatone filters. The codebook was constructed for each subband signals using the LBG algorithm [16] with 17 dimension vector quantization codebook of size L = 128.

C. Subjective and Objective Evaluation The performance of the proposed coder was evaluated using

PESQ. As mentioned in previous section, the recorded speech signals 21 to 30 were used from the NOIZEUS datasets. Table 1 shows the objective evaluation of the proposed system using PESQ. It is interesting to note that the PESQ score for male speaker (x21 to x25) were comparably lower than female speaker (x26 to x30). One of the causes could be the male voice is not sparse enough compare to female voice so that the compressive sensing method produces lower quality reconstructed signals. Nevertheless, the average reconstructed signal quality is 3.593 which can be considered as acceptable quality for everyday communication. The informal listening test also confirmed the objective test using PESQ score.

TABLE I OBJECTIVE EVALUATION USING PESQ FOR VARIOUS SPEECH FILES

Speech Files PESQ Score x21 3.412 x22 3.191 x23 3.361 x24 3.271 x25 3.646 x26 3.994 x27 3.551 x28 3.781 x29 3.857 x30 3.863

Average 3.593

D. Parallel Performance The performance of parallel implementation of the

proposed algorithm will be evaluated in terms of processing time and speed up curve. A multicore system with six cores was used for evaluation. The number of cores used was varied between one (serial program) until maximum six (parallel program). Table 2 shows the elapsed processing time when the number of cores is varied from one to six. It is evident from the table that the higher number the cores, the faster the processing time. We used SPMD programming model to distribute multiple data to many cores. The testing data was 10 speech files, i.e. x21 to x30, of the NOIZEUS datasets.

To further evaluate the effectiveness of the parallelization scheme, Fig. 3 shows the speed up curve for various numbers of cores. Ideally, the speed up increases linearly with higher number of cores. However, the ideal speed up can only be achieved if the communication time between master and slave is negligible and the data is equally distributed. In our case, with 10 speech files that need to be distributed, the data distribution will only be equal for number of cores two (five

files each) and five (two files each). If the distribution is not equal, the elapsed time will be highly influenced by the highest number of tasks in each core. Furthermore, if the time to coordinate and communicate the data is higher than the processing time, the speed up will be lower than the ideal.

TABLE II ELAPSED PROCESSING TIMES FOR VARIOUS NUMBER OF CORES

Number of Cores Processing Time (seconds) 1 59.46 2 35.84 3 29.40 4 22.28 5 16.65 6 16.71

Fig. 3. Speed up for various numbers of cores

V. CONCLUSIONS Speech coding using compressive sensing has been

presented. The proposed algorithm used gammatone filter banks and Discrete Cosine Transform (DCT) to sparsify the speech signal as the reconstructed speech using compressive sensing will work best on sparse signals. To reduce the bitrate requirement for transmission or storage, a vector quantization using codebook was applied to the system. Objective evaluation using PESQ revealed that the average PESQ score was around 3.60 which is acceptable for every day communication. An informal listening test further confirmed the reconstructed speech quality. The proposed algorithm was implemented using SPMD paradigm on a multicore system with maximum of six cores. The higher the number of cores, the lower the processing time. However, it was found that the maximum speed up achievable on our experiment was around 3.57 if six cores were fully utilized. Further research into various transforms to sparsify speech signals and more efficient parallelization scheme will be conducted.

REFERENCES [1] E. J. Candes and M. B. Wakin, "An Introduction to Compressive

Sampling," IEEE Signal Processing Magazine, pp. 21-30, 2008. [2] D. L. Donoho, "Compressed sensing," IEEE Transactions on

Information Theory, vol. 52, pp. 1289-1306, 2006. [3] R. G. Baraniuk, "Compressive sensing," IEEE Signal Processing

Magazine, vol. 24, pp. 118-121, 2007. [4] J. F. Gemmeke and B. Cranen, "Noise reduction through compressed

sensing," in Proceedings of Interspeech, Brisbane, Australia, pp. 1785-1788, 2008.

[5] M. G. Jafari and M. D. Plumbley, "Speech Denoising Based on a Greedy Adaptive Dictionary Algorithm," in 17th European Signal Processing Conference, Glasgow, Scotland, pp. 1423-1426, 2009.

[6] M. G. Jafari and M. D. Plumbley, "An adaptive orthogonal sparsifying transform for speech signals," in International Symposium on Communications, Control and Signal Processing, pp. 786-790, 2008.

[7] T. V. Sreenivas and W. B. Kleijn, "Compressive sensing for sparsely excited speech signals," in IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 4125-4128, 2009.

[8] J. A. Tropp and A. C. Gilbert, "Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit," IEEE Transactions on Information Theory, vol. 53, pp. 4655-4666, 2007.

[9] M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright, "Gradient projection for sparse reconstruction," IEEE Journal of Selected Topics in Signal Processing, vol. 1, pp. 586-597, 2007.

[10] H. Rauhut, K. Schnass, and P. Bandergheynst, "Compressed sensing and redundant dictionaries," IEEE Transactions on Information Theory, vol. 54, pp. 2210-2219, 2008.

[11] N. Hurley and S. Rickard, "Comparing Measures of Sparsity," IEEE Transactions on Information Theory, vol. 55, pp. 4723-4741, 2009.

[12] G. Kubin and W. Kleijn, "On speech coding in a perceptual domain," in International Conference on Acoustic, Speech, and Signal Processing, vol. 1, 1999, pp. 205-208.

[13] D. Salomon, Data Compression: The Complete Reference 3rd Edition, Springer, 2004.

[14] MathWorks, Parallel Computing Toolbox 5: User's Guide, MathWorks, 2010.

[15] J. Ma, Y. Hu, and O. Loizou, "Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions," Journal of Acoustical Society of America, vol. 125, pp. 3387-3405, 2009.

[16] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993.

[IEEE 2011 4th International Conference on Mechatronics (ICOM) - Kuala Lumpur, Malaysia...

Documents

Transcript of [IEEE 2011 4th International Conference on Mechatronics (ICOM) - Kuala Lumpur, Malaysia...