The paper presents the recognition experiment results for four basic emotions of Vietnamese such as happiness, sadness, neutral, and anger with four different corpora depending
on the independence or dependence of the speaker and the content. These experiments were
also conducted with six different parameter sets based on the GMM model. The results show
that the recognition scores are the highest when speaker-dependent and content-dependent
corpus is used. The recognition score is the lowest in the case of speaker-independent and
content-independent corpus. With speaker-dependent but content-independent corpus or
18 trang |
Chia sẻ: huongthu9 | Lượt xem: 536 | Lượt tải: 0
Bạn đang xem nội dung tài liệu Gmm for emotion recognition of Vietnamese, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
Journal of Computer Science and Cybernetics, V.33, N.3 (2017), 229–246
DOI 10.15625/1813-9663/33/3/11017
GMM FOR EMOTION RECOGNITION OF VIETNAMESE
DAO THI LE THUY1,2, TRINH VAN LOAN2, NGUYEN HONG QUANG2
1Faculty of Information Technology, Ha Noi Vocational College of High Technology
2Ha Noi University of Science and Technology; thuydt@hht.edu.vn
Abstract. This paper presents the results of GMM-based recognition for four basic emotions of
Vietnamese such as neutral, sadness, anger and happiness. The characteristic parameters of these
emotions are extracted from speech signals and divided into different parameter sets for experiments.
The experiments are carried out according to speaker-dependent or speaker-independent and content-
dependent or content-independent recognitions. The results showed that the recognition scores are
rather high with the case for which there is a full combination of parameters as MFCC and its first
and second derivatives, fundamental frequency, energy, formants and its correspondent bandwidths,
spectral characteristics and F0 variants. In average, the speaker-dependent and content-dependent
recognition scrore is 89.21%. Next, the average score is 82.27% for the speaker-dependent and content-
independent recognition. For the speaker-independent and content-dependent recognition, the aver-
age score is 70.35%. The average score is 66.99% for speaker-independent and content-independent
recognition. Information on F0 has significantly increased the score of recognition.
Keywords. GMM, recognition, emotion, Vietnamese, corpus, F0.
1. INTRODUCTION
Recognition of emotional speech has been of interest to researchers because it is partic-
ularly useful for applications that require a natural interaction between man and machine.
There are many studies on recognition of emotional speech available in a number of differ-
ent languages around the world such as English, German, Chinese, French, Spanish,. . . [1].
The majority of these studies use speech features in four categories [1]: continuous features
(pitch, energy, formant), voice quality features (easy or hard listening, stress level, breathing
level), spectral features LPC (Linear Prediction Coding), MFCC (Mel Frequency Cepstral
Coefficients), LFPC (Log-frequency power coefficients)), TEO features (TEO-Teager-energy-
operator) proposed by Teager (TEO-FM-Var (TEO-decomposed FM variation), TEO-Auto-
Env (normalized TEO autocorrelation envelope area), TEO-CB-Auto-Env (critical band-
based TEO autocorrelation envelope area)). At present, the study of emotional Vietnamese
is mainly done in terms of language [2]. In terms of signal processing, there are very few
studies on emotional Vietnamese. A number of studies on emotional Vietnamese have been
published, often in multi-modal corpus, combining facial expressions, gestures, and voices
with major applications for synthesis of Vietnamese. For example, the study in [3, 4] tested
the modeling of Vietnamese prosody with multi-modal corpus to synthesize Vietnamese with
emotion. The authors in [5] used SVM (Support Vector Machines) for classification with the
inputs as brain electrical signals and the results showed that real-time recognition is possible
c© 2017 Viet Nam Academy of Science & Technology
230 DAO THI LE THUY, TRINH VAN LOAN, NGUYEN HONG QUANG
with five states of emotion and the average accuracy is 70.5%. In addition, there are few
studies of emotional Vietnamese that are performed abroad and not primarily by Vietnamese
[6, 7]. The corpus of [6] contains two male voices and two female voices with six sentences for
six emotions: happiness, neutrality, sadness, surprise, anger, and fear. The GMM (Gaussian
Mixture Model) model was used with the characteristic parameters such as MFCC, short-
term energy, pitch, formants. The highest recognition score is 96.5% for neutrality and the
lowest is 76.5% for sadness. The corpus for [7] consists of 6 voices with 20 sentences and
the emotions as in [6]. In [7], the recognition score on Vietnamese language using Im-SFLA
(Improved Shuffled Frog Leaping Algorithm) SVM (Support Vector Machine) reaches 96.5%
for neutrality and has dropped to 84.1% for surprise.
For speech emotion recognition, one can use models such as HMM (Hidden Markov
Model), GMM [1, 6, 19], SVM [1, 7], ANN (Artificial Neural Network), KNN (K-Nearest
Neighbors) and some other classifiers [1]. In fact, no classifier is the most suitable for
emotional recognition. Because each classifier has its own advantages and disadvantages.
Nevertheless, GMM is a model that is appropriate for emotion recognition as this is a model
that targets the information envelope rather than the detailed content of the information.
As it can be seen later in Section 3 of this paper, among the three sets of parameters for
determining a GMM model, there are two sets of parameters directly related to the average:
mean vectors and covariance matrices. According to [7, 18], the GMM model is a popular
and promising model for speech emotion recognition. For the research on this paper, the
experiments with GMM were performed on different corpora and different parameters.
The paper consists of 5 sections. Section 2 shows the characteristic parameters and the
corpora used for experiments. Section 3 gives an overview of the GMM model. The exper-
iment results using the GMM model with the specific parameters for Vietnamese emotion
recognition are given in Section 4. Finally, Section 5 is conclusion.
2. CORPORA AND CHARACTERISTIC PARAMETERS FOR
EXPERIMENTS ON EMOTION RECOGNITION
2.1. Corpora for experiments
The corpus of emotional Vietnamese used for the experiments in this paper included 5584
files of the corpus BKEmo [8] with four emotions: neutral, sadness, anger and happiness
being spoken by 8 male and 8 female voices. The authors of this paper have performed
listening to remove the error files or the files that did not express well emotions, so the
remaining files are 5584 files with 22 sentences of different. Among these sentences, there are
short, long, exclamation sentences such as “Got a salary”, “Oh, that person can not change
that” to analyze the characteristic parameters of emotions. Each sentence is pronounced
four times. The number of wave files for each male and female voice is 2792 files, each
emotion has 698 files. The ages of artists participating in emotion pronouncing are from 20-
58. The voice is recorded with sampling frequency 16 kHz, 16 bits/sample. The recording is
conducted in dubbing studio. This corpus used to recognize emotions of Vietnamese for the
four experiment cases is shown in Table 1.
For the corpus Test1, the training and testing corpus have the same content and the
same speakers and sentences with the same content are pronounced at different times. For
the corpus Test2, the training and testing corpus have the same speakers but 22 sentences
GMM FOR EMOTION RECOGNITION OF VIETNAMESE 231
are divided into two parts, the contents of the 11 sentences used for the training and the
remaining used for testing. For the corpus Test3, the number of speakers is divided into two
parts. For the corpus Test4, number of sentences and number of speakers are divided by 2.
Table 1. Vietnamese emotional corpus for experiments with GMM model
Corpus Experiment Corpus Total
Number
of Files
Number of
Training
Files
Number
of Testing
Files
Test1 Speaker-dependent and content-
dependent
5584 2792 2792
Test2 Speaker-dependent, content-
independent
5584 2793 2791
Test3 Speaker-independent, content-
dependent
5584 2794 2790
Test4 Speaker-independent, content-
independent
2803 1403 1400
2.2. Characteristic parameters
The characteristic parameters used for the experiments include 87 parameters as shown
in Table 2. These parameters were extracted from the speech signals in the corpus using
Praat1 and Alize toolkits [9]. Formants and its correspondent bandwidths are determined
by Praat and based on LPC. Fundamental frequency F0 is calculated by Praat and based on
cross-correlation analysis. The range for determining F0 depends on the gender. For female
voices, the maximum F0 value is 350 Hz, and this value is 200 Hz for male voices.
In Table 2, according to Praat harmonicity represents the degree of acoustic periodicity
also known as the Harmonics-to-Noise Ratio (HNR) and can be used as a measure for voice
quality. If S (f) is complex spectrum, where f is the frequency, the centre of gravity is given
by formula
∫∞0 f |S (f)|p df
∫∞0 |S (f)|p df
, (1)
where
∞
∫
0
|S (f)|p df is energy. So, the centre of gravity is the average of frequency f over the
entire frequency domain, weighted by |S (f)|p. For p = 2, the weighting is done by the power
spectrum, and for p = 1, the weighting is done by the absolute spectrum. The commonly
used value is p = 2/3. If S (f) is a complex spectrum, the nthcentral moment is given by (2)
where fc is the spectral centre of gravity
∫∞0 (f − fc)n |S (f)|p df
∫∞0 |S (f)|p df
. (2)
The nth central moment is the average of (f − fc)n over the entire frequency domain,
weighted by |S (f)|p. Moment is related to nth order in formula (2). If n = 2 we have the
1 www.praat.org
232 DAO THI LE THUY, TRINH VAN LOAN, NGUYEN HONG QUANG
variance of the frequencies in the spectrum. Frequency standard deviation is the square root
of this variance.
If n = 3 we will have the third-order central moment, which is also the non-normalized
skewness of the spectrum. Skewness indicates the deviation of the dataset relative to the
standard distribution, if the deviation is below the mean, the data is more concentrated than
that when the deviation is above the mean. The higher the absolute value of skewness is,
the more unbalanced the distribution is. A symmetric distribution will have a skewness of
0.
With n = 4, we have the kurtosis of the non-normalized spectrum. For normalization,
divide by the square of the second central moment and subtract 3. Kurtosis is an index
to evaluate the shape characteristics of a probability distribution. Specifically, kurtosis
compares the central portion of a distribution to its normal distribution. The greater and
the sharper the center of the distribution is, the greater the kurtosis of the distribution is.
The kurtosis of the normal distribution is equal to 3.
Table 2. The characteristic parameters for the Vietnamese emotional corpus
Index Characteristic parameters Number of
parameters
(1) MFCC 19
(2) The 1 st-order derivatives of MFCC 19
(3) The 2nd-order derivatives of MFCC 19
(4) Energy, the 1 st-order and the 2nd-order derivatives of en-
ergy
3
(5) Fundamental frequency F0 1
(6) Speech intensity 1
(7) Formants and its correspondent bandwidths 8
(8) Harmonicity 1
(9) Centre of gravity 1
(10) Central moment 1
(11) Skewness 1
(12) Kurtosis 1
(13) Frequency standard deviation 1
(14) LTAS (Long Term Average Spectrum) mean 1
(15) Slope and standard deviation of LTAS 2
(16) difF0 (t) 1
(17) F0NormAver (t) 1
(18) F0NormMinMax (t) 1
(19) F0NormAverStd (t) 1
(20) difLogF0 (t) 1
(21) LogF0NormMinMax (t) 1
(22) LogF0NormAver (t) 1
(23) LogF0NormAverStd (t) 1
The average value of the spectrum is related to the standard deviation of the spectrum.
GMM FOR EMOTION RECOGNITION OF VIETNAMESE 233
With the classification problem, when a set of values of data tends to be near the average,
the concentration of data is better than that when the data set tends to be far from the
average. Thus, the average can be useful to describe the set of values of the correlated data.
The average of the values x1, ..., xN is
x¯ =
1
N
N∑
j=1
xj . (3)
The variants of F0 shown in Table 1 are as follows.
Derivative of F0 (difF0 (t))
difF0 (t) = dF0 (t) /dt. (4)
Normalization of F0 by the average value F0 for each file (F0NormAver (t))
F0NormAver (t) = F0 (t) /F0 (t). (5)
Normalization of F0 by min value minF0 (t) and max value maxF0 (t) for each file
(F0NormMinMax (t))
F0NormMinMax (t) =
F0 (t)−minF0 (t)
maxF0 (t)−minF0 (t) . (6)
Normalization of F0 by average and standard deviation of F0 (F0NormAverStd (t))
F0NormAverStd (t) =
F0 (t)− F0 (t)
σF0 (t)
. (7)
Derivative of LogF0 (t) (difLogF0 (t))
difLogF0 (t) = dLogF0 (t) /dt. (8)
Normalization of LogF0 (t) by min value min LogF0 (t) and max value max LogF0 (t)
for each file (LogF0NormMinMax (t))
LogF0NormMinMax (t) =
LogF0 (t)−minLogF0 (t)
maxLogF0 (t)−minLogF0 (t) . (9)
Normalization of LogF0 (t) by average of LogF0(t) for each file (LogF0NormAver (t))
LogF0NormAver (t) = LogF0 (t) /LogF0 (t). (10)
Normalization of LogF0 (t) by average and standard deviation of LogF0 (t) for each file
(LogF0NormAverStd (t))
LogF0NormAverStd (t) =
LogF0 (t)− LogF0 (t)
σLogF0 (t)
. (11)
The characteristic parameters in Table 2 are divided into six sets for experiments as
shown in Table 3. The reason for using the parameters as in Table 2 and how to divide
the parameter sets as in Table 3 for emotion recognition of Vietnamese can be explained
234 DAO THI LE THUY, TRINH VAN LOAN, NGUYEN HONG QUANG
as follows. Since the MFCC have been proposed [16, 17], these characteristic parameters
have been used commonly in systems such as speech recognition, speaker recognition, speech
emotion recognition etc. [1]. Thus, the MFCC are considered as the basic characteristic
parameters of these systems. One can say that MFCC are the basic parameters related to the
speech signal spectra which have been condensed and based on the sensibility of the auditory
system. In addition, the characteristic parameters from (8) to (15) are also parameters
related to the speech signal spectrum that are statistically determined. In particular, the
characteristic parameters (11) and (12) are closely related to the standard distribution that
GMM has used. Vietnamese is a tonal language. F0’s variation rules will determine the
tones of the Vietnamese. On the other hand, the F0’s variation rules of a word or a sentence
also contribute to the emotion expression [8]. Therefore, F0 and the parameters from (16)
to (23) are very closely related to Vietnamese and the emotions of the voice.
Table 3. Establishment of characteristic parameter sets used for experiments
Set of
parameters
Name Characteristic parame-
ters for the indexes in
Table 2
Number of
parameters
1 MFCC (1) 19
2 MFCC+Delta1 (1), (2) 38
3 MFCC+Delta12 From (1) to (3) 57
4 prm60 From (1) to (4) 60
5 prm79 From (1) to (15) 79
6 prm87 From (1) to (23) 87
3. VIETNAMESE EMOTIONAL RECOGNITION USING GMM
From the statistical aspects of pattern recognition, each class is modeled by probability
distribution based on available training data. Statistical classifiers have been used in many
speech recognition applications such as HMM, GMM. The GMM model is a probability
model for density estimation using a convex combination of multivariate normal distributions.
GMM can be considered as a special continuous HMM containing only one state [12]. GMM
is very effective when modeling multimodal distributions and training requirements are far
less than the requirements of a general continuous HMM. Thus, GMM is preferable to HMM
for speech emotion recognition as only the general features are extracted from the speech
used for training. However, GMM can not model the time structure of training data because
equations for training and recognition are based on the assumption that all vectors are
independent. In fact, GMM has been used quite commonly for speaker identification [9],
language identification [10], dialect recognition [11] or classification of music genres [13]. In
the case of emotion recognition, each emotion will be modeled by a GMM model and the set
of parameters will be determined through training on the set of learning patterns.
Suppose that with a statement of the emotion j corresponding to K speech frames, each
speech frame extracts the feature vector xi with the dimension D. Thus, a statement of emo-
GMM FOR EMOTION RECOGNITION OF VIETNAMESE 235
tion j corresponds to the set X containing K feature vectors X = {x1, x2, . . . , xK}. Assume
the feature vectors are consistent with the Gaussian distribution in which the distribution
is determined by the mean and the deviation from the mean. From there, the distribution
of the features of emotion j. can be modeled by the mixture of Gaussian distributions. The
mixture model of the Gaussian distribution λj of emotion j will be the weighted sum of M
component distributions determined by the probability
p(X|λj) =
M∑
m=1
gmN(X;µm,Σm). (11)
In (1), gm is the weight of the mixture that satisfies condition
M∑
m=1
gm = 1, N (X;µm,Σm)
are component density functions with the multivariate Gaussian distributions as follows
N (X;µm,Σm) =
1
(2pi)D/2 |Σm|1/2
e−
1
2
(X−µm)TΣ−1m (X−µm). (12)
In (2), µm is the mean vector µm ∈ RD and Σm is the covariance matrix Σm ∈ RD×D.
Thus, the GMM model λj for emotion j is defined by the triple: mean vectors, covariance
matrices, and weights for M components: λj = {µm,Σm, gm}j ,m = 1, 2, . . . ,M . In fact,
the determination of the GMM model λj of the emotion j will be done according to the
expectation-maximization algorithm. This algorithm will determine the maximum likelihood
of log likelihood log(p (X|λj) [14]. In this paper, the Alize toolkit [19] has been used to
evaluate models λj and perform emotion recognition experiments. Although using the Alize
and Praat toolkits, by using MatLab as the intermediate programming language to connect,
coordinate, compute and set up appropriate configuration files, the emotion recognition of
Vietnamese for our research has been performed completely automatically.
4. EXPERIMENT RESULTS
This section presents the recognition experiments based on the GMM model for the four
basic emotions of Vietnamese: neutral, sadness, anger and happiness, which correspond to
the four sets of corpora in Table 1. Each experiment was conducted with the number of Gauss
components M increasing from 16 to 8192 by the power of 2 (M = 2n, n = 4, 5, . . . , 13) and
six sets of parameters in Table 3. The following are four recognition experiments performed
in the same way that evaluates recognition results.
4.1. 4.1 Experiment 1: Speaker-dependent and content-dependent corpus
Figure 1 is the result of emotion recognition with six sets of parameters. The results
show that, in general, the recognition score increases as M increases. When using the
prm87 to identify, the average recognition score was 98.96%, the highest in comparison with
the remaining cases and ranged from 98.96% to 99.97%. In the case of MFCC only, the
recognition score was lowest and ranged from 72.96% to 88.90%, and the mean was 82.96%.
The other four sets of parameters have an approximate recognition score ranging from 87.43%
to 89.19%. When M > 128, the parameter set MFCC + Delta1 gives higher recognition
scores than the parameter sets MFCC+Delta12, prm60, prm79.
236 DAO THI LE THUY, TRINH VAN LOAN, NGUYEN HONG QUANG
M
4096
70
75
80
85
90
95
100
16 32 64 128 256 512 1024 2048 4096 8192
R
e
co
g
n
it
io
n
s
co
re
(
%
)
Number of Gauss components M
MFCC
MFCC+Delta1
MFCC+Delta12
prm60
prm79
prm87
Figure 1. Experiment results with the corpus Test1
Figure 2 is the mean of the recognition results for each emotion for each set of parameters
and for all values of M . Statistical results showed that the average recognition score of sad
emotions was lowest (83.69%). The recognition score for happy emotions is higher than
sad emotions (86.57%). The other two emotions had higher average recognition scores and
these scores were approximately equal, in which angry emotion had the recognition score
of 89.06% and neutral emotion was 89.08%. All four emotions had the highest recognition
scores using the prm87 parameter set with the mean recognition scores of 99.66%, 98.77%,
97.7%, 90.64%, respectively for neutral, happy, angry and sad emotions.
In Experiment 1, when using the parameter set prm87 with M = 4096, the confusion
recognition scores among the emotions were lowest. The confusion recognition scores (%) of
emotions are shown in Table 4. Table 4 shows that the recognition scores are the highest
and the wrong recognition scores are the smallest. The average recognition score of four
emotions is 99.965%, in which happy, sad, neutral emotions are 100% and angry emotions
are 99.86%. The confusion score from angry emotion to happy emotion is only 3.15%. The
rest, among other couples, has a confusion recognition scores which were less than or equal
to 1%.
Table 4. Confusion recognition scores (%) among emotions using Test1
M=4096 Happy Sad Angry Neutral
Happy 100 0 1 0
Sad 0 100 0 0.72
Angry 3.15 0.57 99.86 0
Neutral 0 0.43 0 100
GMM FOR EMOTION RECOGNITION OF VIETNAMESE 237
1 28
75
80
85
90
95
100
R
ec
o
g
n
it
io
n
s
co
re
(
%
)
Characteristic parameter sets
Happy Sad Angry Neutral
Figure 2. Average of recognition scores for four emotions with characteristic parameter sets
in Experiment 1
4.2. Experiment 2: Speaker-dependent and content-independent corpus
The averages of the recognition scores for four emotions with each set of parameters are
shown in Figure 3. Figure 3 shows that, when using the prm87 parameter set, the recognition
score remains the highest in comparison with the remaining parameter sets and ranges from
93% to 99.11%. With the remaining 5 sets of parameters, the recognition scores range from
72.29% to 85.71%. For MFCC parameter set, the recognition score remains the lowest. Two
cases using MFCC+Delta12 and prm60 have the recognition scores almost equal.
70
75
80
85
90
95
100
16 32 64 128 256 512 1024 2048 4096 8192
R
e
co
g
n
it
io
n
s
co
re
(
%
)
Number of Gauss components M
MFCC
MFCC+Delta1
MFCC+Delta12
prm60
prm79
prm87
FigureFigure 3. Experiment results with the corpus Test2
238 DAO THI LE THUY, TRINH VAN LOAN, NGUYEN HONG QUANG
Figure
70
75
80
85
90
95
100
R
ec
o
g
n
it
io
n
s
co
re
(
%
)
Characteristic parameter sets
Happy
Sad
Angry
Neutral
Figure 4. Average of recognition scores for four emotions with characteristic parameter sets
in Experiment 2
Figure 4 shows that, with happy emotion, the recognition score is lower than that of
the other three emotions when using the MFCC+Delta1, MFCC+Delta12, prm60, prm79
parameter sets. However, the average score of recognition with this emotion increases more
strongly than the other three emotions when using the parameter set prm87. With the prm87
parameter set, sad emotion has the lowest exact score compared to the other three emotions.
The recognition scores for angry and neutral emotion increase when using parameter prm79
and prm87.
For Experiment 2, if the parameter set prm87 and M = 128 are used, the confusion
scores for the emotions are the lowest. The confusion scores are summarized in Table 5.
The highest recognition score is 97.98% for happy emotion, and the lowest is 85.09%
for angry emotion. Confusion score from sad to neutral emotions is highest and equals to
3.01%. The remaining cases of confusion have a confusion score less than 1%. On average,
the recognition score of four emotions is 93% and the confusion score is 0.418%.
Table 5. Confusion recognition scores (%) among emotions using Test2
M=128 Happy Sad Angry Neutral
Happy 97.98 0 0.29 0
Sad 0 93.83 0 3.01
Angry 0 0.85 85.09 0
Neutral 0 0.86 0 95.11
GMM FOR EMOTION RECOGNITION OF VIETNAMESE 239
4.3. Experiment 3: Speaker-independent and content-dependent corpus
Figure 5 is the results of recognition with the Test3 corpus. The results indicate that the
set of parameters prm 87 still gives the highest recognition score and the average is 85.44%.
Especially, in this expriment, the highest score is 90.14% with M = 16 and the lowest is
80.54% with M = 256.
1 024
55
60
65
70
75
80
85
90
95
16 32 64 128 256 512 1024 2048 4096 8192
R
ec
o
g
n
it
io
n
s
co
re
(
%
)
Number of Gauss components M
MFCC
MFCC+Delta1
MFCC+Delta12
prm60
prm79
prm87
Figure 5: Experiment results with the corpus Test3
Figure 5. Experiment results with the corpus Test3
Figure
45
55
65
75
85
95
R
ec
o
g
n
it
io
n
s
co
re
(
%
)
Characteristic parameter sets
Happy
Sad
Angry
Neutral
Figure 6. Average of recognition scores for four emotions with characteristic parameter sets
in Experiment 3
In the remaining cases, the recognition scores range from 57.75% to 74.79%. As M .
increases, the recognition scores for these parameter sets are increased but not much from
240 DAO THI LE THUY, TRINH VAN LOAN, NGUYEN HONG QUANG
66.97% to 67.37% only.
Figure 6 shows that, with angry emotion, the recognition score is higher than the other
three emotions when using the corresponding parameter sets. The average of the recognition
scores of happy, angry and neutral emotions increase with the use of parameter set prm87.
Sad emotion has a declining recognition score with the prm87 parameter set. If only MFCC
are used, the recognition score for neutral emotion is minimal.
The confusion score from neutral to sad emotions is 23.42% and is the highest. The aver-
age recognition score of four emotions in Experiment 3 is 80.54%, and the average confusion
score is 2.7%. The confusion scores (%) are shown in Table 6.
Table 6. Confusion recognition scores (%) among emotions using Test3
M=256 Happy Sad Angry Neutral
Happy 89.6 0 0.28 0
Sad 0 63.71 3 0.29
Angry 5.49 0 79.19 0
Neutral 0 23.42 0 89.66
4.4. Experiment 4: Speaker-independent and content-independent corpus
With Experiment 4, the recognition score for the prm87 parameter set is significantly
higher than the remaining parameter sets. When M = 1024, this score is highest at 94.22%
while the average recognition score is 90.76%. The remaining sets of parameters have lower
recognition scores ranging from 52.69% to 69.40%.
Figure 8 is the recognition scores for each emotion. The recognition score of angry
emotion is the best for all parameter sets. Next, the recognition score decreases respectively
with the sad, happy and neutral emotions. In general, the recognition scores for emotions are
less varied when using the MFCC +Delta1, MFCC+Delta12, prm60 and prm79 parameter
sets: happy (52.60% - 62.52%), sad (58.96% - 67.61%), angry (74.21% - 87.44%), neutral
(40.55% - 45.09%).
However, when using the prm87 parameter set, the recognition scores of the emotions
increase: 97.17% (happy), 98.15% (angry), 97.08% (neutral) except for sad emotion, this
score dropps (64.33%) compared with the remaining three emotions.
The confusion score among the emotions is lowest when using the parameter set prm87
with M = 16. The confusion scores are summarized in Table 7. Table 7 shows that the
highest recognition score is 97.44% for happy emotion, the lowest is 48.01% for sad emotion.
Confusion score from neutral to sad emotions is the highest and equal to 25.43% and the
confusion score from angry to happy emotions is only 1.14%. The other pair of emotions has
a confusion score of 0%. The average recognition score for four emotions is 84.42% and the
average confusion score is 2.21%.
4.5. Comparison of experiment results
The average of the recognition scores for four experiments mentioned above is shown
in Figure 9. Figure 9 shows that the average recognition score for the emotions in the
GMM FOR EMOTION RECOGNITION OF VIETNAMESE 241
50
55
60
65
70
75
80
85
90
95
16 32 64 128 256 512 1024 2048 4096 8192
R
ec
o
g
n
it
io
n
s
co
re
(
%
)
Number of Gauss components M
MFCC
MFCC+Delta1
MFCC+Delta12
prm60
prm79
prm87
Figure
Figure 7. Experiment results with the corpus Test4
40
50
60
70
80
90
100
R
ec
o
g
n
it
io
n
s
co
re
(
%
)
Characteristic parameter sets
Happy
Sad
Angry
Neutral
Figure
Figure 8. Average of recognition scores for four emotions with characteristic parameter sets
in Experiment 4
Experiment 1 is the highest and equal to 89.21% and this score is 82.27%, 70.35% and
242 DAO THI LE THUY, TRINH VAN LOAN, NGUYEN HONG QUANG
Table 7. Confusion recognition scores (%) among emotions using Test4
M=16 Happy Sad Angry Neutral
Happy 97.43 0 0 0
Sad 0 48.01 0 0
Angry 1.14 0 97.44 0
Neutral 0 25.43 0 94.8
66.99%, respectively for Experiments 2, 3, and 4. This is appropriate because in Experiment
1, the training and recognition phase have the same speaker and the content is the same, only
the moments of pronunciation are different, so the recognition scores will reach the highest.
For Experiment 4, speakers and content are different for the training and recognition phases.
As a result, the recognition scores for this case will be the lowest.
55
60
65
70
75
80
85
90
95
100
R
ec
o
g
n
it
io
n
s
co
re
(
%
)
Characteristic parameter sets
Test1 Test2 Test3 Test4
Figure 9. Average of recognition scores for four Experiments
4.6. Number of Gauss components M
Figure 10 is the relationship between the number of Gauss components M and the average
of recognition scores of the four Experiments mentioned above. Figure 10 shows that with
a low M value (between 16 and 512), the recognition scores increase significantly. When M
increases from 512 to 8192, the average of recognition scores increases very little.
It can be seen that when M increases sufficiently (over 512), the GMM model almost
reaches the approximation of the emotion modeling, the average of recognition scores in-
creases in the form of saturation with increasing M . The optimal determination of the
GMM FOR EMOTION RECOGNITION OF VIETNAMESE 243
60
65
70
75
80
85
90
16 32 64 128 256 512 1024 2048 4096 8192
R
ec
o
g
n
it
io
n
s
co
re
(
%
)
Number of Gauss components M
Test4_Aver Test3_Aver Test2_Aver Test1_Aver
Figure 10. Relationship between the number of Gauss components M and the average of
recognition scores
Gauss components M is important but it is also a difficult problem [1]. M increases, the
computation time increases as well. Depending on the set of parameters to be included in
the recognition, the optimum value of M should be chosen appropriately according to the
time required to calculate and the desired recognition accuracy.
4.7. Influence of fundamental frequency on Vietnamese emotional recognition
Our recent study [15] on the individual effects of each spectral feature from (7) to (15)
shows that spectral features (11) and (12) have a greater effect on the recognition score
than the rest because characteristic parameters (11) and (12) are based on the standard
distribution that the GMM uses. In the following, the effect of each F0 variant on the score
of Vietnamese emotional recognition will be presented. As it can be seen from Fig. 9, when
the parameters directly related to F0 are added, the recognition score increases significantly
compared to the addition of parameters directly related to the spectrum. When adding
variants of F0 (from prm79 to prm87), the average of recognition scores increases the most
strongly for Test4 and the increase was 24.32%. The smallest increase is 10.05% for Test1.
However, the smallest increase in this case is still greater than the maximum increase in the
case of addition of spectral features (6.29% for Test4). To consider the individual effects of
each variant F0, all other parameters except the F0 variants were preserved.
The number of Gaussian components M = 512 was chosen to conduct the evaluation.
From 4.6, this value of M can be considered as belonging to the range of the fast-rising
recognition score to the slower-rising recognition score when increasing M. Figure 11 is the
244 DAO THI LE THUY, TRINH VAN LOAN, NGUYEN HONG QUANG
= 512
F0 variants and Tests from 1 to 4
Test1 Test2 Test3 Test4
82
84
86
88
90
92
94
96
98
100
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
Figure 11. Average of recognition scores for four emotions depends on F0 variants, other 79
parameters, and Tests from 1 to 4 with M = 512
average of recognition scores for four emotions when one of the eight F0 variants was added
while the remaining parameters remained unchanged and M = 512. The effect of the F0
variants is not quite the same for the different Tests. Among four Tests, with the addition
of F0 variants, Test3 had a lower recognition score. With Test1, F0 variants (18), (19),
(20), (22) and (23) increased the maximum score of recognition and this score reached 100%.
Similar to Test1, Test3 and 4 had the highest score of recognition by adding F0 variant (23)
and the recognition score is 87.42% and 93.46%, respectively. Meanwhile, F0 variant (23)
had the least effect on Test2 compared to the other three Tests. When adding F0 variant
(16), Test2 had the highest score of recognition and this score was 96.56%. Corresponding
to Test1, Test2, Test3, and Test4, respectively, F0 variants (17), (23), (20), and (18) have
the least effect. The significant increase in Vietnamese emotional recognition scores when
supplementing the F0 variants is perfectly reasonable because F0 plays a very important
role in the tonal language such as Vietnamese, while F0 also participates actively on the
emotional expression.
5. CONCLUSION
The paper presents the recognition experiment results for four basic emotions of Viet-
namese such as happiness, sadness, neutral, and anger with four different corpora depending
on the independence or dependence of the speaker and the content. These experiments were
also conducted with six different parameter sets based on the GMM model. The results show
that the recognition scores are the highest when speaker-dependent and content-dependent
corpus is used. The recognition score is the lowest in the case of speaker-independent and
content-independent corpus. With speaker-dependent but content-independent corpus or
GMM FOR EMOTION RECOGNITION OF VIETNAMESE 245
speaker-independent but content-dependent corpus, the recognition scores are intermediate
between the two cases with the highest and lowest recognition scores. In all four experiments,
the prm87 parameter set always gave the highest recognition scores. In general, informa-
tion on fundamental frequency has significantly increased the score of Vietnamese emotional
recognition.
REFERENCES
[1] Moataz El Ayadi, Mohamed S. Kamel, Fakhri Karray, “Survey on speech emotion recognition:
Features, classification schemes, and databases”, Pattern Recognition, vol.44, pp. 572–587, 2011.
[2] Do Tien Thang, “Primary examination of Vietnamese intonation”, Ha Noi National University
Publishing House, 2009.
[3] Dang-Khoa Mac, Eric Castelli, Ve´ronique Auberge´, “Modeling the prosody of Vietnamese atti-
tudes for expressive speech synthesis”, Workshop of Spoken Languages Technologies for Under
- resourced Languages (SLTU 2012), Cape Town, South Africa, May 7-9, 2012.
[4] Dang-Khoa Mac, Do-Dat Tran, “Modeling Vietnamese speech prosody: a step-by-step approach
towards an expressive speech synthesis system ”, Springer, Trends and Applications in Knowl-
edge Discovery and Data Mining, vol. 9441, Springer, pp. 273–287, 2015.
[5] Viet Hoang Anh, Manh Ngo Van, Bang Ban Ha, Thang Huynh Quyet, “A real-time model based
support vector machine for emotion recognition through EEG”, International Conference on
Control, Automation and Information Sciences (ICCAIS), Ho Chi Minh city, Vietnam, Nov
26-29, 2012.
[6] La Vutuan, Huang Cheng-Wei, Ha Cheng, Zhao Li, “Emotional feature analysis and recognition
from vietnamese speech ”, Journal of Signal Processing, China, vol.29, no.10, pp 1423–1432,
Oct 2013.
[7] Jiang Zhipeng, Huang Chengwei, “High-order markov random fields and their applications in
cross-language speech recognition ”, Cybernetics and Information Technologies,Sofia, volume
15, no. 4, pp 50–57, 2015.
[8] Le Xuan Thanh, Dao Thi Le Thuy, Trinh Van Loan, Nguyen Hong Quang, “Speech emotions
and statistical analysis for vietnamese emotion corpus”, Journal of Vietnam Ministry of In-
formation and Communication, no. 15 (35), pp 86–98, 2016.
[9] Jean-Franois Bonastre, Fre´de´ric Wils, “Alize, a free toolkit for speaker recognition”, IEEE
International Conference, In ICASSP (1), pp. I 737 – I 740, 2005.
[10] Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., and Deller
Jr., J. R., “Approaches to language identification using gaussian mixture models and shifted
delta cepstral features ”, In Proc. International Conference on Spoken Language Processing
in Denver, CO, ISCA, pp. 33-36, 82-92, September, 2002.
[11] Bin MA, Donglai ZHU and Rong TONG, “Chinese dialect identification using tone features
based on pitch ”, ICASSP, pp 1029–1032, 2006.
[12] D. Reynolds, C. Rose, “Robust text-independent speaker identification using Gaussian mixture
speaker models”, IEEE Trans, Speech Audio Process, vol. 3, no. 1, 72–83, 1995.
246 DAO THI LE THUY, TRINH VAN LOAN, NGUYEN HONG QUANG
[13] Bacı U., Erzin E., “Boosting Classifiers for Music Genre Classification”, In: Yolum., Gu¨ngo¨r
T., Gu¨rgen F., O¨zturan C. (eds) Computer and Information Sciences – ISCIS, Lecture Notes in
Computer Science, vol 3733. Springer, Berlin, Heidelberg, 2005.
[14] J. Bilmes, “A gentle tutorial of the EM algorithm and its application to parameter estimation for
Gaussian mixture and hidden Markov models”, Technical Report TR-97-021. International
Computer Science Institute (ICSI), Berkeley, CA, pp 1–13, 1998.
[15] J.-F. Bonastre, F. Wils, and S. Meignier, “Alize, a free toolkit for speaker recognition”, ICASSP
(1), pp. 737 – 740, 2005.
[16] Dao Thi Le Thuy, Trinh Van Loan, Nguyen Hong Quang, Le Xuan Thanh, “Influence of spectral
features of speech signal on emotion recognition of Vietnamese”, Fundamental and Applied IT
Research (FAIR), pp 36–43, 2017.
[17] P. Mermelstein, “Distance measures for speech recognition, psychological and instrumental,” in
Pattern Recognition and Artificial Intelligence, C. H. Chen, Ed., Academic, New York, pp.
374–388, 1976.
[18] S.B. Davis, and P. Mermelstein, “Comparison of parametric representations for monosyllabic
word recognition in continuously spoken sentences”, in IEEE Transactions on Acoustics,
Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
[19] Marcel Kockmann, Luka’s”. Burget, Jan “Honza” C”ernocky, “Application of speaker- and
language identification state-of-the-art techniques for emotion recognition”, Speech Communi-
cation, vol. 53, pp 1172–1185, 2011.
[20] R. Subhashree1, G. N. Rathna, “Speech emotion recognition: performance analysis based on
fused algorithms and gmm modelling ”, Indian Journal of Science and Technology, vol 9(11),
doi: 10.17485/ijst/2016/v9i11/88460, March 2016.
Received on December 26, 2017
Revised on February 08, 2018
Các file đính kèm theo tài liệu này:
- gmm_for_emotion_recognition_of_vietnamese.pdf