Improving bottleneck features for vietnamese large vocabulary continuous speech recognition system using deep neural networks

In this work, the bottleneck features of Vietnamese speech recognition system improved by using deep neural network and DBNFs have shown the ability to achieve significant improvements from a 27% relative word error rate reduction reported previously to 39%, compared to MFCC baseline system. It is shown that denoising auto-encoders proved to be good models for initializing bottleneck networks of Vietnamese speech recognition system and the others state-ofthe-art techniques still can be applied after DBNFs system. The experiment setups used in this paper do not employ tonal feature as input feature of neural network and just use a weak language model. Therefore, in the future the researchers intend to investigate the tonal feature extraction to be used as apart of neural network input as well as build a stronger language model using deep neural network as in [25].

10 trang | Chia sẻ: huongthu9 | Lượt xem: 425 | Lượt tải: 0

Bạn đang xem nội dung tài liệu Improving bottleneck features for vietnamese large vocabulary continuous speech recognition system using deep neural networks, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

Journal of Computer Science and Cybernetics, V.31, N.4 (2015), 267–276 DOI: 10.15625/1813-9663/31/4/5944 IMPROVING BOTTLENECK FEATURES FOR VIETNAMESE LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION SYSTEM USING DEEP NEURAL NETWORKS QUOC BAO NGUYEN1, TAT THANG VU2, AND CHI MAI LUONG2 1University of Information and Communication Technology, Thai Nguyen University; nqbao@ictu.edu.vn 2Institute of Information Technology, Vietnam Academy of Science and Technology; vtthang@ioit.ac.vn, lcmai@ioit.ac.vn Abstract. In this paper, the pre-training method based on denoising auto-encoder is investigated and proved to be good models for initializing bottleneck networks of Vietnamese speech recognition system that result in better recognition performance compared to base bottleneck features reported previously. The experiments are carried out on the dataset containing speeches on Voice of Vietnam channel (VOV). The results show that the DBNF extraction for Vietnamese speech recognition de- creases relative word error rate by 14% and 39% compared to the base bottleneck features and MFCC baseline, respectively. Keywords. Deep bottleneck features, neural network, Vietnamese speech recognition. 1. INTRODUCTION In automatic speech recognition systems, features extraction task is an important part of achieving a good recognition performance. Previous works [1,2] have shown that artificial neural networks can be used to extract good, discriminative features that yield better recognition performance than standard feature extraction algorithms like Mel Frequency Cepstral Coefficient (MFCC) and Perceptual Linear Prediction (PLP). One possible approach for this is to train a network with a small bottleneck layer, and then use the activations of the units in this layer to produce feature vectors (“bottleneck features”, BNF [1]) for the remaining parts of the system. Recently, deep learning has gained a lot of attention in the machine learning community. The general objective of this field is the training of large neural networks with many hidden layers, so- called deep neural networks (DNNs). While most frequently used in computer vision, multiple recent works [3–6] have demonstrated the ability of deep networks to achieve superior performance on speech recognition tasks as well. In this study, deep neural networks are also applied to improve bottleneck features for Vietnamese speech recognition which were reported previously [7]. We show that a pre-training algorithm produce better than neural network models using standard methods, and that deeper neural networks result in better performance of the resulting Vietnamese speech recognition system. We also compare and combine DBNFs with other state-of-the-art techniques in order to determine the important of DBNFs in Vietnamese recognition. c© 2015 Vietnam Academy of Science & Technology 268 QUOC BAO NGUYEN, TAT THANG VU, AND CHI MAI LUONG 2. DEEP LEARNING In recent years, deep learning has gained a lot of attention in the machine learning community. The general objective of this field is the training of large neural networks with many hidden layers, so-called deep neural networks (DNNs). There are multiple reasons why deep learning is attractive. From a theoretical point of view, they are more efficient than shallow ones in the sense that they are able to represent complex functions with exponentially fewer computational elements [8]. In theory, this makes them more suitable for high-dimensional classifications problems with complicated decision. Another motivation for deep architectures is the automatic discovery of feature hierarchies where high-level features are composed of low-level features. For example, in image processing task a feature representing a face might be composed of features for eyes, a nose and more, which are represented as combinations of simple edge detectors. There are some deep learning algorithms that are also unsupervised in that they do not use labels when training the network. While accurately labeling training data takes a time and is labor-intense task, this is a very attractive property. Especially in speech recognition, where recordings first have to be transcribed by humans, systems can benefit from leveraging unlabeled speech data. 2.1. Denoising Auto-encoders Auto-encoders are a special kind of ANNs that are not trained for classification task but are trained to reconstruct the network input as well as possible after they have been transformed by a hidden layer. The hidden layer of a trained auto-encoder provides an hidden representation (encoding) of the input data. If the number of units in the hidden layer is smaller than the number of input features, it is forced to learn a compact and invertible encoding of its input [9, 10]. This corresponds to a non-linear dimensionality reduction. The Figure 1 shows the architecture of an auto-encoder with a small number of hidden units. The model is consisting of an input layer, a single hidden layer and an output layer with the same dimensionality as the input layer. Again, the backpropagation algorithm can be used to train the auto-encoder, using the values of the input vector as targets for the output layer. Figure 1: Auto-encoder architecture. Auto-encoder architecture can be used for building deep neural networks, which was explored by Bengio et al. as an alternative to restricted Boltzmann machines. The main idea of this approach is training of each additional hidden layer as an auto-encoder for the hidden representation of the IMPROVING BOTTLENECK FEATURES FOR VIETNAMESE LVCSR SYSTEM ... 269 previous layer. The resulting stack of auto-encoders can be transformed into a standard, feed-forward neural network. Denoising Auto-encoders (DAE) is an alternative approach to extend classic auto-encoders for deep learning purposes proposed by Vincent et al. [11] the input data is first corrupted by applying random noise to the individual features. Afterwards, the model is trained to reconstruct the uncor- rupted input from the corrupted input in an auto-encoder-like fashion. In their work, Vincent et al. showed that hidden representations learned from randomly corrupted input differ from the re- sults achieved with standard sparsity constraints and may provide more useful features when adding further layers. The individual steps of the computation are performed by a denoising auto-encoder which are illustrated in Figure 2. The main difference to normal auto-encoders is the corruption of an input vector x. This can be formalized as the application of a stochastic process qD x˜ ∼ qD(x˜|x) (1) Figure 2: Architecture of a denoising auto-encoder with masking noise. The black color indicates that units are set to zero (masked). This process is performed by adding random noise to individual training examples. Vincent et al. proposed multiple possible noise models [12]. The most common one consists of randomly setting a fraction of the elements of the input vector to zero. Another one is Gaussian noise which replaces every element of a random sample drawn from a normal distribution with mean and a fixed variance. The last one is salt-and-pepper noise which sets random elements of to their minimum or maximum value. While some types of noise can be regards as more natural choices for a given task, Vincent et al. demonstrated that all types result in learning useful hidden representations. In this work, the masking noise is also applied to the data by setting every element of the input vector to zero with a fixed probability. Then the corrupted input x˜ first maps (with an encoder) to the hidden representation y using the weight matrix W of the hidden layer, the bias vector b of the hidden units and a non-linear activation function σy as follows: y = σy(Wx˜+ b) (2) The latent representation y or code is then mapped back with a decoder into reconstruction z using the transposed weight matrix and the visible bias vector c. Because in a model using tied weights, the weight values are used for both encoding and decoding, again through a similar transformation σz : z = σz(W T y + c) (3) 270 QUOC BAO NGUYEN, TAT THANG VU, AND CHI MAI LUONG The parameters of this model (namely W,W T , b, c) are optimized such that the average recon- struction error is minimized. The reconstruction error can be measured by the cross-entropy error objective as defined in (4) in order to obtain the gradients necessary for adjusting the network weights. LH(x, z) = ∑ i xi log zi + (1− xi) log zi. (4) However, when training a network on speech features like MFCCs, the first layer models real valued rather than binary data, so the mean squared error L2(x, z) = ∑ i(xi − zi)2 is selected as the training criterion. 2.2. Building Deep Neural Networks from Denoising Auto-encoders Denoising auto-encoders can be stacked into deep learning architectures [12] like standard auto- encoders or restricted Boltzmann machines. This process starts with training a single DAE that is trained to reconstruct corrupted versions of the input data. Afterwards, another DAE is trained on the hidden representation y of the first model, leaving the weights of the first model fixed as describe in Figure 3 . This scheme can be continued for the desired number of layers. Each time, auto-encoders computing the input representation for the model that is being trained to perform their computation on uncorrupted input. Figure 3: Training a stack of denoising auto-encoders. The hidden representation y of the first DAE is now used as the input x of the second DAE, which is being corrupted, encoded to y and reconstructed. When each layer has been pre-trained, the hidden representation of the whole system can be transformed into a deep neural network. This involves replacing the decoding part of the top-most DAE with a neural network output layer, and using the encoder parts of the remaining auto-encoders to initialize the hidden layers. The resulting network now can be treated like a multi-layer perceptron and fine-tuned with standard backpropagation. 3. BOTTLENECK FEATURES USING DEEP NEURAL NETWORKS This section is mentioned as in our previous works [13, 14], the researchers briefly describe the deep neural network architecture for bottleneck feature extraction proposed in [4] and depicted in Figure 4. The network consists of a variable number of moderately large, fully connected hidden layers and a small bottleneck layer which is followed by an additional hidden layer and the final classification layer. The architecture differs from setups previously described, where the bottleneck layer has been IMPROVING BOTTLENECK FEATURES FOR VIETNAMESE LVCSR SYSTEM ... 271 placed in the middle of a deep network [6], [15] or added as a second model trained on the output values of the original network [16]. Hidden Layers Output layer Bottleneck layer Speech input feature window Discarded after network training Figure 4: Deep Bottleneck Features Architecture. 3.1. Neural Network Input The MFCCs were used as input of deep neural network, which contain 39 coefficients including 12 cepstral coefficients, 1 energy coefficient added with delta and double-delta features were extracted after windowing with the window size of 25 milliseconds and frame shift of 10 milliseconds. Then they were pre-processed using the splicing speaker-adapted features approach as in [17] with 40 dimensions. This features for each frame were stacked with 9 adjacent samples, resulting in a total of 360 dimensions. 3.2. Layer-wise Pre-training The hidden layers in front of the bottleneck are initialized using unsupervised, layer-wise pre-training. Thanks to their success in the deep learning community, restricted Boltzmann machines have become the default choice for pre-training the individual layers of deep neural networks used in speech recog- nition. Gehring et al. [4] demonstrated that denoising auto-encoders [11] which are straight-forward models that have been successfully used for pre-training neural architectures for computer vision and sentiment classification [18] are applicable to speech data as well. The researchers follow their training scheme and initialize the hidden layers as denoising autoen- coders, too. Like regular auto-encoders, these models consist of one hidden layer and two identically- sized layers representing the input and output values. The network is usually trained to reconstruct its input at the output layer with the goal to generate a useful intermediate representation in the hidden layer. In denoising auto-encoders, the network is trained to reconstruct a randomly corrupted 272 QUOC BAO NGUYEN, TAT THANG VU, AND CHI MAI LUONG version of its input, which can be interpreted as a regularizing mechanism that facilitates the learning of large and over complete hidden representations [11]. 3.3. Adding the Bottleneck and Fine-tuning After a stack of auto-encoders has been pre-trained in this fashion, a deep neural network can be constructed. The bottleneck layer, an additional hidden layer and the classification layer are initialized with random weights and connected to the hidden representation of the top-most auto-encoder. While all output hidden units use sigmoid active function, the classification layer output is obtained with the softmax activation function. The resulting network is then trained with supervision to estimate either context-independent or context-dependent HMM tri-phone states. For this last training step, errors are obtained with the cross-entropy function. Finally, the last two layers of the network can be discarded and the units in the bottleneck layer provide the final features used for training standard Gaussian mixture (GMM) acoustic models. 4. EXPERIMENTAL SETUP The Voice of Vietnam (VOV) corpus was used in our experiments, which is a collection of story reading, mailbag, new reports, and colloquy from the radio program the Voice of Vietnam. There are 23424 utterances in this corpus including about 30 male and female broadcasters and visitors. The number of distinct syllables with tone is 4923 and the number of distinct syllables without tone is 2101. The total capacity of the corpus in WAV format with 16 kHz sampling rate and analog/digital conversion precision of 16 bits is about 2.5 GB. The total time of the corpus is about 19hours which was separated into training set of 17 hours and 2 hours test set. All of transcriptions in the training data were used to train tri-gram language model. 4.1. Baseline Systems Baseline HMM/GMM systems were performed with the Kaldi developed at Johns Hopkins Univer- sity [19]. The researchers extracted two sets of acoustic features to build baseline acoustic models. Those are MFCCs and PLP features, which are popular in speech recognition applications. In both feature extraction, 16-KHz speech input is coded with 13-dimensional MFCCs with a 25ms window and a 10ms frame-shift. Each frame of the speech data is represented by a 39-dimensional feature vector that consists of 13 MFCCs with their deltas and double-deltas. Nine consecutive feature frames are spliced to 40 dimensions using linear discriminant analysis (LDA) and maximum likelihood lin- ear transformation (MLLT) [20] that is a feature orthogonalizing transform, is applied to make the features more accurately modeled by diagonal-covariance Gaussians. All models used 4,600 context-dependent state and 96,000 Gaussian mixture components. The baseline systems were built, follow a typical maximum likelihood acoustic training recipe, beginning with a flat-start initialization of context-independent phonetic HMMs, followed by tri-phone system with 13-dimensional MFCCs or PLP plus their deltas and double-deltas and ending with tri-phone system using LDA+MLLT. 4.2. Network training The details of network training was mentioned in our previous works [3, 14]. The network input for these features was pre-processed using the splicing speaker-adapted features approach as in [17]. IMPROVING BOTTLENECK FEATURES FOR VIETNAMESE LVCSR SYSTEM ... 273 During supervised fine-tuning, the neural network was trained to predict context-dependent HMM states (there were about 4600 states in our experiments). For pre-training the stack of auto-encoders in the architecture at section 3. Mini-batch gradient descent with a batch size of 128 and a learning rate of 0.01 was used. Input vectors were corrupted by applying masking noise to set a random 20% of their elements to zero. Each auto-encoder contained 1024 hidden units and after 20 epochs the weights were fixed and the next one was trained on top of it. The remaining layers were then added to the network, with the bottleneck consisting of 39 units. Again, gradients were computed by averaging across a mini-batch of training examples; for fine-tuning, a larger batch size of 256 was used. The learning rate was decided by the “newbob” schedule: for the first epoch, 0.008 was used as the learning rate, and this was kept fixed as long as the increment in cross-validation frame accuracy in a single epoch was higher than 0.5%. For the subsequent epochs, the learning rate was halved; this was repeated until the increase in cross-validation accuracy per epoch is less than a stopping threshold, of 0.1%. After each epoch, the current model was evaluated on a separate validation set, and the model performing best on this set was used in the speech recognition system afterwards. For these experiments we used GPUs for training of auto-encoder layers and neural networks using the Theano toolkit [21]. Training on 17 hours of VOV data took about a 18 hours for the architecture of network, around 9 million parameters were used. 5. EXPERIMENTAL RESULTS In first experiments, different network architectures and inputs were compared. For the input data, it was found that extracting deep bottleneck features from MFCC instead of PLP data resulted in consistently better recognition performance of about 0.4% WER absolute. Then it was decided to use MFCC features for further experiments. Third column of Table 1 lists the WER of BNF systems which were reported previously in [7] and trained on Multilayer Perceptron (MLP) network using 3 layers (1000 units each) without using any pre-training technique. The WER of BNF system using MFCC as input feature of neural network is 15.5% and 14.7% with PLP feature. While in this experiments, the DBNF proposed for Vietnamese speech recognition is described in section 3.. As it is seen that DBNF number is about 2% absolute (about 12% relative) better than BNF number. Systems Baseline BNF DBNF (DAE layer)s 1 2 3 4 5 6 7 PLP 22.08 14.7 17.19 14.38 13.93 13.88 13.86 13.77 13.99 MFCC 21.25 15.5 16.11 13.99 13.76 13.68 13.40 13.39 13.48 No pre-training 15.52 14.84 14.33 14.35 14.41 14.51 14.69 Table 1: Recognition performance for the Vietnamese system with MFCC and PLP features 5.1. Importance of Pre-training The experiments were evaluated to extract bottleneck features from the networks of different depth and without unsupervised pre-training in order to check the importance of pre-training the stack of denoising auto-encoder in front of the bottleneck layer. 274 QUOC BAO NGUYEN, TAT THANG VU, AND CHI MAI LUONG 12 12.5 13 13.5 14 14.5 15 15.5 16 16.5 17 1 2 3 4 5 6 7 W ER ( % ) number of auto-encoder layer with pretraining no pretraining Figure 5: Comparison of recognition performance for pretrained and purely supervised trained neural networks. Table 1 and Figure 5 list experiments to determine that if pre-training is applied, recognition performance improves when adding additional auto-encoders. If the neural network just trained supervised starting from the random initialization, recognition performance went down with deeper models. 5.2. Compare and Combine with state-ofthe-art Techniques Since DBNFs was demonstrated in achieving good Vietnamese recognition performance, some experi- ments were evaluated using the other state-of-the-art techniques as follows: Speak Adaptive Training (SAT) [22], discriminative training using maximum mutual information (MMI) objective function [23] and subspace gaussian mixture models (SGMM) [24] in order to compare the recognition performance with DBNFs and determine the important of DBNFs in Vietnamese recognition. According to the results listed in Table 2, the state-of-the-art techniques still can be applied after DBNFs system and the best full-range of techniques DBNFs number is 12% relative better than the best full-range of techniques applied at baseline number (12.88% vs 14.68%). System MLLT SAT MMI SAT+MMI SGMM Baseline 21.25 17.08 17.66 16.89 14.68 DBNF 13.39 12.90 13.19 12.88 13.04 Table 2: Combine DBNFs and Baseline systems with Different State-of-the-art Techniques. IMPROVING BOTTLENECK FEATURES FOR VIETNAMESE LVCSR SYSTEM ... 275 6. CONCLUSIONS In this work, the bottleneck features of Vietnamese speech recognition system improved by using deep neural network and DBNFs have shown the ability to achieve significant improvements from a 27% relative word error rate reduction reported previously to 39%, compared to MFCC baseline system. It is shown that denoising auto-encoders proved to be good models for initializing bottleneck networks of Vietnamese speech recognition system and the others state-ofthe-art techniques still can be applied after DBNFs system. The experiment setups used in this paper do not employ tonal feature as input feature of neural network and just use a weak language model. Therefore, in the future the researchers intend to investigate the tonal feature extraction to be used as apart of neural network input as well as build a stronger language model using deep neural network as in [25]. REFERENCES [1] F. Grezl, M. Karafiat, S. Kontair, and J. Cernocky, “Probabilistic and bottle-neck features for lvcsr of meetings,” in Acoustics, Speech and Signal Processing (ICASSP), 2007 IEEE Interna- tional Conference on. IEEE, 2007, pp. V–757 – IV–760. [2] K. Kilgour, I. Tseyzer, Q. B. Nguyen, and A. Waibel, “Warped minimum variance distortionless response based bottle neck features for lvcsr.” in ICASSP, 2013, pp. 6990–6994. [3] Q. B. Nguyen, J. Gehring, K. Kilgour, and A. Waibel, “Optimizing deep bottleneck feature extraction,” in Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2013 IEEE RIVF International Conference on, Nov 2013, pp. 152–156. [4] J. Gehring, Y. Miao, F. Metze, and A. Waibel, “Extracting deep bottleneck features using stacked auto-encoders,” in ICASSP2013, Vancouver, CA, 2013, pp. 3377–3381. [5] Q. B. Nguyen, J. Gehring, M. Muller, S. Stuker, and A. Waibel, “Multilingual shifting deep bottleneck features for low-resource asr,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, May 2014, pp. 5607–5611. [6] D. Yu and M. L. Seltzer, “Improved bottleneck features using pretrained deep neural networks,” in INTERSPEECH, 2011, pp. 237–240. [7] V. H. Nguyen, C. M. Luong, and T. T. Vu, “Applying bottle neck feature for vietnamese speech recognition,” pp. 379–388, 2013. [8] Y. Bengio, “Learning deep architectures for ai,” Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1–127, Jan. 2009. [Online]. Available: [9] H. Ackley, E. Hinton, and J. Sejnowski, “A learning algorithm for boltzmann machines,” Cog- nitive Science, pp. 147–169, 1985. [10] G. E. Hinton, “Connectionist learning procedures,” Artif. Intell., vol. 40, no. 1-3, pp. 185–234, 1989. [11] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in ICML08, 2008, pp. 1096–1103. [12] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learn. Res., vol. 11, pp. 3371–3408, Dec. 2010. [Online]. Available: 276 QUOC BAO NGUYEN, TAT THANG VU, AND CHI MAI LUONG [13] Q. B. Nguyen, T. T. Vu, and C. M. Luong, “Improving acoustic model for english asr system using deep neural network,” in Computing & Communication Technologies-Research, Innovation, and Vision for the Future (RIVF), 2015 IEEE RIVF International Conference on. IEEE, 2015, pp. 25–29. [14] ——, “Improving acoustic model for vietnamese large vocabulary continuous speech recognition system using deep bottleneck features,” in Knowledge and Systems Engineering. Springer, 2015, pp. 49–60. [15] Z. Tu¨ske, R. Schlu¨ter, and H. Ney, “Deep hierarchical bottleneck mrasta features for lvcsr,” in ICASSP, 2013, pp. 6970–6974. [16] T. Sainath, B. Kingsbury, and B. Ramabhadran, “Auto-encoder bottleneck features using deep belief networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, 2012, pp. 4153–4156. [17] S. P. Rath, D. Povey, K. Vesely, and J. Cernocky, “Improved feature processing for deep neural networks.” in INTERSPEECH. ISCA, 2013, pp. 109–113. [18] X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for large-scale sentiment classification: A deep learning approach,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 513–520. [19] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understand- ing. IEEE Signal Processing Society, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB. [20] M. Gales, “Maximum likelihood linear transformations for hmm-based speech recognition,” Com- puter Speech and Language, vol. 12, no. 2, pp. 75 – 98, 1998. [21] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: A cpu and gpu math compiler in python,” in Pro- ceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 3 – 10. [22] T. Anastasakos, J. W. McDonough, R. M. Schwartz, and J. Makhoul, “A compact model for speaker-adaptive training.” in ICSLP. ISCA, 1996. [23] D. Povey, “Discriminative training for large vocabulary speech recognition,” Ph.D. dissertation, Ph. D. thesis, Cambridge University, 2004. [24] D. Povey, L. Burget, M. Agarwal, P. Akyazi, F. Kai, A. Ghoshal, O. Glembek, N. Goel, M. Karafia´t, A. Rastrow et al., “The subspace gaussian mixture modela structured model for speech recognition,” Computer Speech & Language, vol. 25, no. 2, pp. 404–439, 2011. [25] N. Q. Pham, H. S. Le, T. T. Vu, , and C. M. Luong, “The speech recognition and machine translation system of ioit for iwslt 2013,” in Proceedings of the International Workshop for Spoken Language Translation (IWSLT), 2013. Received on March 08 - 2015 Revised on October 19 - 2015

Các file đính kèm theo tài liệu này:

improving_bottleneck_features_for_vietnamese_large_vocabular.pdf