Voice Activity Detection Using Higher Order Statistics.pdf

Voice Activity Detection

Using Higher Order Statistics

J.M. Gorriz, J. Ram ırez, J.C. Segura, and S. Hornillo

Dept. Teor ıa de la Senal, Telematica y comunicaciones,

Facultad de Ciencias , Universidad de Granada,

Fuentenueva s/n, 18071 Granada, Spain

gorriz@ugr.es

Abstract. A robust and eﬀective voice activity detection (VAD) al-

gorithm is proposed for improving speech recognition performance in

noisy environments. The approach is based on ﬁltering the input chan-

nel to avoid high energy noisy components and then the determina-

tion of the speech/non-speech bispectra by means of third order auto-

cumulants. This algorithm diﬀers from many others in the way the de-

cision rule is formulated (detection tests) and the domain used in this

approach. Clear improvements in speech/non-speech discrimination ac-

curacy demonstrate the eﬀectiveness of the proposed VAD. It is shown

that application of statistical detection test leads to a better separation

of the speech and noise distributions, thus allowing a more eﬀective dis-

crimination and a tradeoﬀ between complexity and performance. The

algorithm also incorporates a previous noise reduction block improving

the accuracy in detecting speech and non-speech.

1 Introduction

Nowadays speech/non-speech detection is a complex problem in speech process-

ing and aﬀects numerous applications including robust speech recognition [1],

discontinuous transmission [2, 3], real-time speech transmission on the Internet

[4] or combined noise reduction and echo cancellation schemes in the context

of telephony [5]. The speech/non-speech classiﬁcation task is not as trivial as

it appears, and most of the VAD algorithms fail when the level of background

noise increases. During the last decade, numerous researchers have developed

diﬀerent strategies for detecting speech on a noisy signal [6] and have evaluated

the inﬂuence of the VAD eﬀectiveness on the performance of speech processing

systems [7]. Most of them have focussed on the development of robust algorithms

with special attention on the derivation and study of noise robust features and

decision rules [8, 9, 10]. The diﬀerent approaches include those based on energy

thresholds [8], pitch detection [11], spectrum analysis [10], zero-crossing rate [3],

periodicity measure [12], higher order statistics in the LPC residual domain [13]

or combinations of diﬀerent features [3, 2]. This paper explores a new alternative

towards improving speech detection robustness in adverse environments and the

performance of speech recognition systems. The proposed VAD proposes a noise

J. Cabestany, A. Prieto, and D.F. Sandoval (Eds.): IWANN 2005, LNCS 3512, pp. 837–844, 2005.

Springer-Verlag Berlin Heidelberg 2005

838

J.M. Gorriz et al.

reduction block that precedes the VAD, and uses Bispectra of third order cumu-

lants to formulate a robust decision rule. The rest of the paper is organized as

follows. Section 2 reviews the theoretical background on Bispectra analysis and

shows the proposed signal model, analyzing the motivations for the proposed

algorithm by comparing the speech/non-speech distributions for our decision

function based on bispectra and when noise reduction is optionally applied. Sec-

tion 3 describes the experimental framework considered for the evaluation of the

proposed statistical decision algorithm. Finally, section summarizes the conclu-

sions of this work.

2 Model Assumptions

denote the discrete time measurements at the sensor. Consider the

set of stochastic variables y k , k =0 , ±

{x ( t )

}

1 ...± M

obtained from the shift of the

input signal

{x ( t )

}

y k ( t )= x ( t + k · τ )

(1)

· m + 1 variables by selecting n =1 ...N samples of the

input signal which can be represented using the associated Toeplitz matrix.

Using this model the speech-non speech detection can be described by using

two essential hypothesis(re-ordering indexes):

⎛

y 0 = n 0

y ± 1 = n ± 1

...

y ±M = n ±M

⎞

⎛

y 0 = s 0 + n 0

y ± 1 = s ± 1 + n ± 1

...

y ±M = s ±M + n ±M

⎞

H o =

⎝

⎠

;

H 1 =

⎝

⎠

(2)

where s k ’s/ n k ’s are the speech/non-speech (any kind of additive background

noise i.e. gaussian) signals, related themselves with some diﬀerential parameter.

All the process involved are assumed to be jointly stationary and zero-mean.

Consider the third order cumulant function C y k y l

deﬁned as:

∞

C y k y l ≡ E [ y 0 y k y l ];

C y k y l ( ω 1 ,ω 2 )=

C y k y l ·

exp(

−j ( ω 1 k + ω 2 l ))) (3)

k = −∞

l = −∞

and the two-dimensional discrete Fourier transform (DFT) of C y k y l , the bispec-

trum function. The sequence of cumulants of the voice speech is modelled as a

sum of coherent sine waves:

C y k y l =

=1 a nm cos [ knω

0 + lmω

0 ]

(4)

n,m

where a nm is amplitude, K × K is the number of sinusoids and ω is the fun-

damental frequency in each dimension. It follows from equation 4 that a mn is

related to the energy of the signal

E s = E{s

}

. The VAD proposed in the later

Let

where k · τ is the diﬀerential delay (or advance) between the samples. This

provides a new set of 2

Voice Activity Detection Using Higher Order Statistics

839

2 x 10 4

3 rd Order Cumulant ( V 3 )

x 10 9

Averaged Signal

3 rd Order Cumulant ( V 3 )

Averaged Signal

x 10 8

4000

2000

−2000

−1

−2

−4000

−2

−4

−2

−50

−4

2000

4000

6000

8000

−50

−6000

−50

200

400

600

800

1000

−50

Lag

τ 0

( s )

Time ( s )

Lag

τ 0

( s )

Time ( s )

Bispectrum Magnitude ( V 3 / Hz 2 )

x 10 11

Bispectrum Phase ( deg )

Bispectrum Magnitude ( V 3 / Hz 2 )

x 10 10

Bispectrum Phase ( deg )

0.5

150

100

1.5

−50

0.5

−100

−150

−0.5

0.5

−0.5

0.5

−0.5

0.5

−0.5

0.5

Frequency f

( Hz )

Frequency f

( Hz )

Frequency f

( Hz )

Frequency f

( Hz )

(a)

(b)

Fig. 1. Diﬀerent Features allowing voice activity detection. (a) Features of Voice Speech

Signal. (b) Features of non Speech Signal

reference only works with the coe cients in the sequence of cumulants and is

more restrictive in the model of voice speech. Thus the Bispectra associated to

this sequence is the DTF of equation 4 which consist in a set of Dirac´s deltas

in each excitation frequency nω

0 , mω

C y k y l ( ω 1 ,ω 2 )

≡C n k n l ( ω 1 ,ω 2 )

(5)

and on H 1 :

C y k y l ( ω 1 ,ω 2 )

≡C s k s l ( ω 1 ,ω 2 )

= 0

(6)

Since s k ( t )= s ( t + k · τ ) where k =0 , ±

1 ...± M ,weget

C s k s l ( ω 1 ,ω 2 )=

F{E [ s ( t + k · τ ) s ( t + l · τ ) s ( t )]

}

(7)

The estimation of the bispectra (equation 3) is deep discussed in [14] and

many others, where conditions for consistency are given. The estimate is said

to be (asymptotically) consistent if the squared deviation goes to zero, as the

number of samples tends to inﬁnity.

2.1 Detection Tests for Voice Activity

The decision of our algorithm implementing the VAD is based on statistical

tests from references [15] (Generalized likelihood ratio tests) and [16] (Central

2 tests require larger data sets to achieve a consistent

theoretical asymptotic distribution. Then we decline to use it unlike the GLRT

tests.

If we reorder the components of the set of L Bispectrum estimates ˆ

( n l ,m l )

where l

=1 ,...,L , on the ﬁne grid around the bifrequency pair into a L vec-

tor

β ml

where

=1 ,...P

indexes the coarse grid [15] and deﬁne P-vectors

0 . Our algorithm will detect any high fre-

quency peak on this domain matching with voice speech frames, that is under

the above assumptions and hypotheses, it follows that on H 0 ,

2 tests.

The tests are based on some asymptotic distributions and computer simulations

in [17] show that the

2 -distributed test statistic under H 0 ). We will call the tests GLRT and χ

840

J.M. Gorriz et al.

φ i ( β 1 i ,...,β Pi ), i =1 ,...L ; the generalized likelihood ratio test for the above

discussed hypothesis testing problem:

µ> 0 (8)

where µ =1 /L i =1 φ i and σ =1 /L i =1 ( φ i −µ )( φ i −µ ) T , leads to the activity

voice speech detection if:

H 0 : µ =0

against H 1 : η ≡ µ T σ − 1

η>η 0

(9)

where η 0 is a constant i.e. the probability of false alarm.

2.2 Noise Reduction Block

Almost any VAD can be improved just placing a noise reduction block in the data

channel before it. The noise reduction block for high energy noisy peaks, consists

of four stages(1) Spectrum smoothing 2)Noise estimation 3)Wiener Filter (WF)

design and 4)Frequency domain ﬁltering) and was ﬁrst developed in [18].

2.3 Some Remarks About the Algorithm

We propose a alternative decision based on an average of the components of the

bispectrum (the absolute value of it). In this way we deﬁne η as:

L · N

( i, j )

η =

(10)

i =1

j =1

where L , N deﬁnes the selected grid (high frequencies with noteworthy variabil-

ity). We also include long term information (LTI) in the decision of the on-line

VAD [19] which essentially improves the eciency of the proposed method as is

shown the following pseudocode:

– Initialize variables

– Determine η 0 of noise in the ﬁrst frame

– for i=1 to end:

1. Consider a new frame (i)

calculate η ( i )

2. if H 1 then

•

VAD(i)=1

•

apply LTI to VAD(i- τ )

else

•

Slow Update of noise parameters: η 0 ( i +1)= αη 0 + βη ( i ),

α + β =1

α →

•

apply LTI to VAD(i- τ )

Fig. 2 shows the operation of the proposed VAD on an utterance of the Span-

ish SpeechDat-Car (SDC) database [20]. The phonetic transcription is: [“siete”,

“ θ inko”, “dos”, “uno”, “otSo”, “seis”]. Fig 2(b) shows the value of

versus

time. Observe how assuming

η 0 the initial value of the magnitude

over the

Voice Activity Detection Using Higher Order Statistics

841

2 x 10

VAD decision

10 x 10 10

−1

−0.5

0.5

1.5

0.5

1.5

2.5

x 10 4

10 x 10

etha

Threshold

100

150

200

250

300

frame

200

400

600

800

1000

1200

(a)

(b)

Fig. 2. Operation of the VAD on an utterance of Spanish SDC database. (a) Evaluation

of η and VAD Decision. (b) Evaluation of the test hypothesis on an example utterance

of the Spanish SpeechDat-Car (SDC) database [20]

ﬁrst frame (noise), we can achieve a good VAD decision. It is clearly shown how

the detection tests yield improved speech/non-speech discrimination of fricative

sounds by giving complementary information. The VAD performs an advanced

detection of beginnings and delayed detection of word endings which, in part,

makes a hang-over unnecessary. In Fig 1 we display the diﬀerences between noise

and voice in general and in ﬁgure we settle these diﬀerences in the evaluation of

η on speech and non-speech frames.

3 Experimental Framework

The ROC curves are frequently used to completely describe the VAD error rate.

The AURORA subset of the original Spanish SpeechDat-Car (SDC) database

[20] was used in this analysis. This database contains 4914 recordings using

close-talking and distant microphones from more than 160 speakers. The ﬁles

are categorized into three noisy conditions: quiet, low noisy and highly noisy

conditions, which represent diﬀerent driving conditions with average SNR val-

ues between 25dB, and 5dB. The non-speech hit rate (HR0) and the false alarm

rate (FAR0= 100-HR1) were determined in each noise condition being the ac-

tual speech frames and actual speech pauses determined by hand-labelling the

database on the close-talking microphone. Fig. 3 shows the ROC curves of the

proposed VAD (BiSpectra based-VAD) and other frequently referred algorithms

[8, 9, 10, 6] for recordings from the distant microphone in quiet, low and high

noisy conditions. The working points of the G.729, AMR and AFE VADs are

also included. The results show improvements in detection accuracy over stan-

dard VADs and similarities over representative set VAD algorithms [8, 9, 10, 6].

The beneﬁts are especially important over G.729, which is used along with a

speech codec for discontinuous transmission, and over the Li’s algorithm, that

is based on an optimum linear ﬁlter for edge detection. On average ( HR 0+ HR 1

2 ),

the proposed VAD is similar to Marzinzik’s VAD that tracks the power spectral

envelopes, and the Sohn’s VAD, that formulates the decision rule by means of

a statistical likelihood ratio test. These results clearly demonstrate that there is

no optimal VAD for all the applications. Each VAD is developed and optimized

for speciﬁc purposes. Hence, the evaluation has to be conducted according to the

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: