Voice Activity Detection Using Higher Order Statistics.pdf

(489 KB) Pobierz
215874281 UNPDF
Voice Activity Detection
Using Higher Order Statistics
J.M. Gorriz, J. Ram ırez, J.C. Segura, and S. Hornillo
Dept. Teor ıa de la Senal, Telematica y comunicaciones,
Facultad de Ciencias , Universidad de Granada,
Fuentenueva s/n, 18071 Granada, Spain
gorriz@ugr.es
Abstract. A robust and effective voice activity detection (VAD) al-
gorithm is proposed for improving speech recognition performance in
noisy environments. The approach is based on filtering the input chan-
nel to avoid high energy noisy components and then the determina-
tion of the speech/non-speech bispectra by means of third order auto-
cumulants. This algorithm differs from many others in the way the de-
cision rule is formulated (detection tests) and the domain used in this
approach. Clear improvements in speech/non-speech discrimination ac-
curacy demonstrate the effectiveness of the proposed VAD. It is shown
that application of statistical detection test leads to a better separation
of the speech and noise distributions, thus allowing a more effective dis-
crimination and a tradeoff between complexity and performance. The
algorithm also incorporates a previous noise reduction block improving
the accuracy in detecting speech and non-speech.
1 Introduction
Nowadays speech/non-speech detection is a complex problem in speech process-
ing and affects numerous applications including robust speech recognition [1],
discontinuous transmission [2, 3], real-time speech transmission on the Internet
[4] or combined noise reduction and echo cancellation schemes in the context
of telephony [5]. The speech/non-speech classification task is not as trivial as
it appears, and most of the VAD algorithms fail when the level of background
noise increases. During the last decade, numerous researchers have developed
different strategies for detecting speech on a noisy signal [6] and have evaluated
the influence of the VAD effectiveness on the performance of speech processing
systems [7]. Most of them have focussed on the development of robust algorithms
with special attention on the derivation and study of noise robust features and
decision rules [8, 9, 10]. The different approaches include those based on energy
thresholds [8], pitch detection [11], spectrum analysis [10], zero-crossing rate [3],
periodicity measure [12], higher order statistics in the LPC residual domain [13]
or combinations of different features [3, 2]. This paper explores a new alternative
towards improving speech detection robustness in adverse environments and the
performance of speech recognition systems. The proposed VAD proposes a noise
J. Cabestany, A. Prieto, and D.F. Sandoval (Eds.): IWANN 2005, LNCS 3512, pp. 837–844, 2005.
c
Springer-Verlag Berlin Heidelberg 2005
215874281.025.png
838
J.M. Gorriz et al.
reduction block that precedes the VAD, and uses Bispectra of third order cumu-
lants to formulate a robust decision rule. The rest of the paper is organized as
follows. Section 2 reviews the theoretical background on Bispectra analysis and
shows the proposed signal model, analyzing the motivations for the proposed
algorithm by comparing the speech/non-speech distributions for our decision
function based on bispectra and when noise reduction is optionally applied. Sec-
tion 3 describes the experimental framework considered for the evaluation of the
proposed statistical decision algorithm. Finally, section summarizes the conclu-
sions of this work.
2 Model Assumptions
denote the discrete time measurements at the sensor. Consider the
set of stochastic variables y k , k =0 , ±
{x ( t )
}
1 ...± M
obtained from the shift of the
input signal
{x ( t )
}
:
y k ( t )= x ( t + k · τ )
(1)
· m + 1 variables by selecting n =1 ...N samples of the
input signal which can be represented using the associated Toeplitz matrix.
Using this model the speech-non speech detection can be described by using
two essential hypothesis(re-ordering indexes):
y 0 = n 0
y ± 1 = n ± 1
...
y ±M = n ±M
y 0 = s 0 + n 0
y ± 1 = s ± 1 + n ± 1
...
y ±M = s ±M + n ±M
H o =
;
H 1 =
(2)
where s k ’s/ n k ’s are the speech/non-speech (any kind of additive background
noise i.e. gaussian) signals, related themselves with some differential parameter.
All the process involved are assumed to be jointly stationary and zero-mean.
Consider the third order cumulant function C y k y l
defined as:
C y k y l ≡ E [ y 0 y k y l ];
C y k y l ( ω 1 2 )=
C y k y l ·
exp(
−j ( ω 1 k + ω 2 l ))) (3)
k = −∞
l = −∞
and the two-dimensional discrete Fourier transform (DFT) of C y k y l , the bispec-
trum function. The sequence of cumulants of the voice speech is modelled as a
sum of coherent sine waves:
K
C y k y l =
=1 a nm cos [ knω
0 + lmω
0 ]
(4)
n,m
where a nm is amplitude, K × K is the number of sinusoids and ω is the fun-
damental frequency in each dimension. It follows from equation 4 that a mn is
related to the energy of the signal
E s = E{s
2
}
. The VAD proposed in the later
Let
where k · τ is the differential delay (or advance) between the samples. This
provides a new set of 2
215874281.026.png
 
Voice Activity Detection Using Higher Order Statistics
839
2 x 10 4
3 rd Order Cumulant ( V 3 )
x 10 9
Averaged Signal
3 rd Order Cumulant ( V 3 )
50
Averaged Signal
x 10 8
6
4000
50
1
4
2000
4
2
2
0
0
0
0
0
−2000
0
−1
−2
−4000
−2
−4
−2
−50
−4
0
2000
4000
6000
8000
−50
0
50
−6000
−50
0
200
400
600
800
1000
−50
0
50
Lag
τ 0
( s )
Time ( s )
Lag
τ 0
( s )
Time ( s )
Bispectrum Magnitude ( V 3 / Hz 2 )
x 10 11
Bispectrum Phase ( deg )
Bispectrum Magnitude ( V 3 / Hz 2 )
x 10 10
Bispectrum Phase ( deg )
0.5
0.5
0.5
0.5
2
150
5
150
100
100
1.5
50
4
50
0
0
0
0
3
0
0
1
−50
2
−50
0.5
−100
1
−100
−150
−150
−0.5
−0.5
−0.5
−0.5
−0.5
0
0.5
−0.5
0
0.5
−0.5
0
0.5
−0.5
0
0.5
Frequency f
0
( Hz )
Frequency f
0
( Hz )
Frequency f
0
( Hz )
Frequency f
( Hz )
0
(a)
(b)
Fig. 1. Different Features allowing voice activity detection. (a) Features of Voice Speech
Signal. (b) Features of non Speech Signal
reference only works with the coe cients in the sequence of cumulants and is
more restrictive in the model of voice speech. Thus the Bispectra associated to
this sequence is the DTF of equation 4 which consist in a set of Dirac´s deltas
in each excitation frequency
0 ,
2
C y k y l ( ω 1 2 )
≡C n k n l ( ω 1 2 )
0
(5)
and on H 1 :
C y k y l ( ω 1 2 )
≡C s k s l ( ω 1 2 )
= 0
(6)
Since s k ( t )= s ( t + k · τ ) where k =0 , ±
1 ...± M ,weget
C s k s l ( ω 1 2 )=
F{E [ s ( t + k · τ ) s ( t + l · τ ) s ( t )]
}
(7)
The estimation of the bispectra (equation 3) is deep discussed in [14] and
many others, where conditions for consistency are given. The estimate is said
to be (asymptotically) consistent if the squared deviation goes to zero, as the
number of samples tends to infinity.
2.1 Detection Tests for Voice Activity
The decision of our algorithm implementing the VAD is based on statistical
tests from references [15] (Generalized likelihood ratio tests) and [16] (Central
χ
2 tests require larger data sets to achieve a consistent
theoretical asymptotic distribution. Then we decline to use it unlike the GLRT
tests.
If we reorder the components of the set of L Bispectrum estimates ˆ
χ
C
( n l ,m l )
where l
=1 ,...,L , on the fine grid around the bifrequency pair into a L vec-
tor
β ml
where
m
=1 ,...P
indexes the coarse grid [15] and define P-vectors
0 . Our algorithm will detect any high fre-
quency peak on this domain matching with voice speech frames, that is under
the above assumptions and hypotheses, it follows that on H 0 ,
1
2 tests.
The tests are based on some asymptotic distributions and computer simulations
in [17] show that the
2 -distributed test statistic under H 0 ). We will call the tests GLRT and χ
215874281.027.png 215874281.001.png 215874281.002.png 215874281.003.png 215874281.004.png 215874281.005.png 215874281.006.png 215874281.007.png 215874281.008.png 215874281.009.png 215874281.010.png 215874281.011.png 215874281.012.png 215874281.013.png 215874281.014.png 215874281.015.png
840
J.M. Gorriz et al.
φ i ( β 1 i ,...,β Pi ), i =1 ,...L ; the generalized likelihood ratio test for the above
discussed hypothesis testing problem:
µ> 0 (8)
where µ =1 /L i =1 φ i and σ =1 /L i =1 ( φ i −µ )( φ i −µ ) T , leads to the activity
voice speech detection if:
H 0 : µ =0
against H 1 : η ≡ µ T σ 1
η>η 0
(9)
where η 0 is a constant i.e. the probability of false alarm.
2.2 Noise Reduction Block
Almost any VAD can be improved just placing a noise reduction block in the data
channel before it. The noise reduction block for high energy noisy peaks, consists
of four stages(1) Spectrum smoothing 2)Noise estimation 3)Wiener Filter (WF)
design and 4)Frequency domain filtering) and was first developed in [18].
2.3 Some Remarks About the Algorithm
We propose a alternative decision based on an average of the components of the
bispectrum (the absolute value of it). In this way we define η as:
1
L · N
L
N
( i, j )
η =
ˆ
(10)
i =1
j =1
where L , N defines the selected grid (high frequencies with noteworthy variabil-
ity). We also include long term information (LTI) in the decision of the on-line
VAD [19] which essentially improves the eciency of the proposed method as is
shown the following pseudocode:
Initialize variables
Determine η 0 of noise in the first frame
for i=1 to end:
1. Consider a new frame (i)
calculate η ( i )
2. if H 1 then
VAD(i)=1
apply LTI to VAD(i- τ )
else
Slow Update of noise parameters: η 0 ( i +1)= αη 0 + βη ( i ),
α + β =1
α →
1
apply LTI to VAD(i- τ )
Fig. 2 shows the operation of the proposed VAD on an utterance of the Span-
ish SpeechDat-Car (SDC) database [20]. The phonetic transcription is: [“siete”,
θ inko”, “dos”, “uno”, “otSo”, “seis”]. Fig 2(b) shows the value of
η
versus
time. Observe how assuming
η 0 the initial value of the magnitude
η
over the
C
215874281.016.png 215874281.017.png
Voice Activity Detection Using Higher Order Statistics
841
2 x 10
4
VAD decision
10 x 10 10
−1
−0.5
0
0.5
1
1.5
9
8
7
0
0.5
1
1.5
2
2.5
x 10 4
6
10 x 10
10
5
etha
8
4
2
4
6
3
2
Threshold
1
0
0
50
100
150
200
250
300
0
frame
0
200
400
600
800
1000
1200
(a)
(b)
Fig. 2. Operation of the VAD on an utterance of Spanish SDC database. (a) Evaluation
of η and VAD Decision. (b) Evaluation of the test hypothesis on an example utterance
of the Spanish SpeechDat-Car (SDC) database [20]
first frame (noise), we can achieve a good VAD decision. It is clearly shown how
the detection tests yield improved speech/non-speech discrimination of fricative
sounds by giving complementary information. The VAD performs an advanced
detection of beginnings and delayed detection of word endings which, in part,
makes a hang-over unnecessary. In Fig 1 we display the differences between noise
and voice in general and in figure we settle these differences in the evaluation of
η on speech and non-speech frames.
3 Experimental Framework
The ROC curves are frequently used to completely describe the VAD error rate.
The AURORA subset of the original Spanish SpeechDat-Car (SDC) database
[20] was used in this analysis. This database contains 4914 recordings using
close-talking and distant microphones from more than 160 speakers. The files
are categorized into three noisy conditions: quiet, low noisy and highly noisy
conditions, which represent different driving conditions with average SNR val-
ues between 25dB, and 5dB. The non-speech hit rate (HR0) and the false alarm
rate (FAR0= 100-HR1) were determined in each noise condition being the ac-
tual speech frames and actual speech pauses determined by hand-labelling the
database on the close-talking microphone. Fig. 3 shows the ROC curves of the
proposed VAD (BiSpectra based-VAD) and other frequently referred algorithms
[8, 9, 10, 6] for recordings from the distant microphone in quiet, low and high
noisy conditions. The working points of the G.729, AMR and AFE VADs are
also included. The results show improvements in detection accuracy over stan-
dard VADs and similarities over representative set VAD algorithms [8, 9, 10, 6].
The benefits are especially important over G.729, which is used along with a
speech codec for discontinuous transmission, and over the Li’s algorithm, that
is based on an optimum linear filter for edge detection. On average ( HR 0+ HR 1
2 ),
the proposed VAD is similar to Marzinzik’s VAD that tracks the power spectral
envelopes, and the Sohn’s VAD, that formulates the decision rule by means of
a statistical likelihood ratio test. These results clearly demonstrate that there is
no optimal VAD for all the applications. Each VAD is developed and optimized
for specific purposes. Hence, the evaluation has to be conducted according to the
215874281.018.png 215874281.019.png 215874281.020.png 215874281.021.png 215874281.022.png 215874281.023.png 215874281.024.png
Zgłoś jeśli naruszono regulamin