HTML Paper HMM MODELING FOR AUDIO-VISUAL SPEECH RECOGNITION Qi Zhi, Mustafa Nazmi Kaynak, Kuntal Sengupta, Adrian David Cheok, C. C. Ko Department of Electrical and Computer Engineering National University of Singapore 10 Kent Ridge Crescent Singapore 119260 Email:

[email protected]

ABSTRACT number of states, the number of Gaussian mixtures) on the recognition accuracy. Bimodal speech recognition is a robust technique for automated speech analysis, and has received a lot of attention in the last few Which geometric visual features can lead to a better recog- decades. In this paper, we analyze the effect of the HMM mod- nition rate? els on the performance of the bimodal speech recognizer, present Different sampling rates of the acoustic and the visual sig- a comparative analysis of the different HMM models that can be nal, and consequently the features extracted lead to the prob- used in bimodal speech recognition, and finally propose a novel lem of integration of these features. What is the optimal model, which has been experimentally verified to perform better way to fuse such multimodal signals? than others. One of the unique characteristic of our HMM model In this paper, we present a systematic study of the above prob- is the novel fusion strategy of the acoustic and the visual features, lems, and the results of this study aid us in the design of our very that takes into account the different sampling rates of these two own bimodal speech recognition engine, which is distinct from signals. Compared to audio only, the bimodal speech recognition most commercial and academic systems. We present a compara- scheme has a much more improved recognition accuracy, espe- tive performance of this system with a standard HTK HMM. cially in presence of noise. 2. HMM MODELING 1. INTRODUCTION Speech signals are slowly time varying signals. When examined Bimodal speech recognition is a novel extension of audio based over a sufficiently short period of time (between 5 and 100 msec), speech recognition. The main motivation behind bimodal speech speech signal can be regarded as being stationary [7]. HMM mod- recognition is the bimodal characteristics of speech perception and els have been widely used for speech recognition studies for the production systems in human beings. This bimodal characteristics last few years. In this section, we briefly discuss the basics of the is demonstrated by ”McGurk effect” [1]. When a subject is pre- HMM theory, followed by the description of the acoustic and the sented with conflicting acoustic and visual signals, the perceived visual features chosen for the HMM. signal might be completely different than both acoustic and visual signals. The primary advantage of the visual signal is that it is 2.1. Basic HMM theory not affected by acoustic noise and cross talk among speakers. An- Hidden Markov Model is a stochastic model and widely used for other advantage is the complementary structure [2] of phonemes characterizing the spectral properties of the frames of patterns [7]. and visemes [2], which are the smallest acoustically and visually A HMM model is characterized by: (1) the number of states distinguishing units of a language respectively. in the model (N), (2) the number of Gaussian mixtures per state The previous studies about visual speech signal are mostly (M),(3) The state transition probability distribution (A), (4) The based on hidden markov models, neural networks and fuzzy logic. observation symbol probability distribution (B), and (5) the initial Petajan developed one of the first audio visual speech recogni- state distribution . The compact notation is used tion systems, [3]. Stork et al. used the time-delayed neural net- to indicate the complete parameter set of an HMM model. works (TDNN’s) for audio visual speech recognition [4]. Tomlin- HMM has three fundamental problems which should be solved son et al. suggested a HMM structure, cross-product model topol- for the model to be used in real-world applications. These prob- ogy, which incorporates a state transition topology allowing asyn- lems are: (i) Evaluation (recognition) problem is computing the chronous behaviour of acoustic and visual features for audio visual probability ( ) given the observation sequence , and the speech recognition [5]. In [6], Dupont et al. studied audio-visual model , (ii) Segmentation problem is the determination of the speech modeling for continuous speech recognition. He used an best (optimal) state sequence given the observation sequence appearance-based model of the visual articulators. , and the model , and (iii) Training problem is the With the plethora of work in the audio visual speech recogni- adjustment of model parameters ( ) so as to best ac- tion area, a few issues that still needs to be answered are: count for the model states. This is equal to adjust the model pa- What is the optimal model of a bimodal HMM? To our rameters to maximize . knowledge there has not been a single systematic study on For our research, we use a left-right HMM model for which the effect of the different modelling parameters (like the no state transitions more than two states are allowed. We mainly 201 2001 IEEE International Conference on Multimedia and Expo 0-7695-1198-8/01/$10.00 (C) 2001 IEEE ISBN 0-7695-1198-8/01 $17.00 © 2001 IEEE Acoustic speech recognition using different HMMs Recognition Rate (%) N 4 4 4 5 5 5 SNR(dB) M 8 16 32 8 16 32 HMM Clean 30dB 25dB 20dB 15dB 10dB 5dB 0dB SNR Recognition Rate ( % ) Source clean 96.7 96.3 96.3 90.4 95.4 96.7 HTK HMM 100 100 97.9 94.2 84.2 75.0 60.4 47.1 30dB 96.7 96.3 96.3 92.5 95.4 97.5 Matlab 96.3 96.3 96.7 95.8 95.0 92.9 83.3 62.1 HMM 25dB 95.8 95.8 96.7 92.9 96.3 97.1 20dB 94.6 96.7 95.8 92.5 95.8 96.7 Table 2: The recognition accuracy comparison of HTK and Matlab 15dB 91.7 95.0 95.0 92.9 93.3 93.7 HMM 10dB 86.3 91.3 92.9 89.2 93.3 90.0 5dB 79.6 81.3 83.3 77.5 80.8 79.2 ble 1 shows the average recognition results for different HMMs. 0dB 55.4 60.8 62.1 52.5 57.1 57.9 From the table, it can be seen that as the number of Gaussian mix- ture components increases, the performances of the HMMs be- Table 1: Acoustic HMM modeling (N: number of states; M: num- come better. Among six different digit models, the HMM with ber of Gaussian mixture components) thirty-two Gaussian mixtures and four states seem to be the best use two different HMM codes for our research. The first one is one. A very important conclusion from Table 1 is that, even though the HTK HMM code from Microsoft and the second one is our the vocabulary used for this research is limited to ten digits, as own HMM code written in Matlab. We used HTK HMM code as noise level increases the recognition result decreases to very low a benchmark for the comparison of the performance of our HMM levels. Hence this conclusion reveals that at noisy environments, code. This is to evaluate if the recognition accuracy is reason- visual signal can be possibly used together with acoustic signal to able or not. For the development of our HMM code, the imple- increase the recognition performance of the system. For acous- mentations for the first two problems of HMM were trivial. We tic modeling, we use both the HTK HMM and developed Matlab use forward backward procedure and Viterbi algorithm to solve HMM . Table 2 shows the comparison of the recognition accu- the first and second problems of HMM respectively. For the most racies of developed Matlab HMM and HTK HMM codes. From difficult HMM problem, namely training problem, we use Baum- Table 2, we conclude that the developed HMM code also performs Welch algorithm by taking into account the practical implementa- well, so for the rest of experiments this code is used in order that tion issues, such as scaling, multiple observation sequences, initial the recognition results of acoustic only, visual only, and audio- parameter estimates, which are explained in [7]. In the following visual speech recognition could be rigorously compared. In the sections we describe the HMM modeling of acoustic and visual next section, we discuss the visual signal modeling and analyze the signals and we analyze basic geometric visual features for bimodal basic geometric visual features for bimodal speech recognition. speech recognition. 2.3. Visual HMM modeling 2.2. Acoustic HMM modeling In audio-visual and visual speech recognition studies, some of the Applications that need speech processing require specific repre- visual features used are geometric/model based parameters such as sentations of speech information. A wide range of possibilities area, height, width of the mouth region; gray scale parameters of exists for parametically representing the speech signal. These in- the mouth region; lip contours arrived at by curve fitting; spline clude the short time energy, zero crossing rates, level crossing parameters of inner/outer contour and motion parameters obtained rates and other related parameters. Among these the most impor- by 3D tracking. In this section, we explain the HMM modeling of tant parametric representation of speech is short time spectral en- basic geometric visual features for the bimodal speech recognition. velope [7]. Linear Predictive Coding (LPC) and Mel Frequency In this research, we experimentally carry out a rigorous systematic Cepstral Coefficients (MFCC) spectral analysis models have been analysis of important geometric visual features , and determine the used widely for speech recognition applications. Usually together effects of modeling parameters of HMM on the recognition result with MFCC coefficients, first and second order derivatives are also of these visual features . After single visual feature analysis, we used to take into account the dynamic evolution of the speech sig- experiment on the combinations of geometric visual features to nal, which carries relevant information for speech recognition. find the best performing visual feature combination for bimodal For our acoustic speech recognition experiments, we use thir- speech recognition. For visual HMM modeling, we use left-right teen dimensional MFCC as the standard audio features. In order to HMM models with continuous observation densities and diagonal find the best HMM, we also experimentally examine many differ- covariance matrices. ent models. For each digit model, the recognition results are plot- In order to have an accurate and noise-free training, in the ted as a function of signal to noise ratio (SNR). In the experiments, audio-visual database six blue points were stuck on the speaker’s we add white noise to obtain noisy speech signals and calculate the face as shown in Fig. 1. After automatically detecting the centers’ SNR as the logarithm of the ratio between the average powers of of the dots using a correlation based approach, we computed four the speech signal and the white noise signal. For each digit model, basic geometrical lip parameters: outer-lip horizontal width(X), the state number is fixed to four or five and eight, sixteen or thirty- outer-lip vertical width(Y), outer-lip area(area) and the angle of two Gaussian mixtures were used repeatedly in order to find the the outer-lip corner(angle)which are shown in Fig. 2. The verti- optimum model. cal distance between the points on the chin and nose was used to There are ten subjects in our audio-visual database. For train- normalize X and Y, so that our features became invariant to the ing each digit model we used seventy-seven sets of training data distance between the speaker and the camera. from nine speakers and for evaluating the recognition performances Observing the videos sequences of digits, we find that not only we used twenty-four sets of checking data from ten speakers. Ta- the shape of the outer-lip contour but also the movements of the lip 202 2001 IEEE International Conference on Multimedia and Expo 0-7695-1198-8/01/$10.00 (C) 2001 IEEE ISBN 0-7695-1198-8/01 $17.00 © 2001 IEEE Combined Visual Feature Experiment N 4 4 4 4 5 5 5 M 4 8 16 32 8 16 32 Visual Average Recognition Rate (%) Feature X-Y 55.4 57.9 62.5 63.3 63.3 62.9 62.9 Y-d(Y)/dt 47.5 46.7 47.1 42.9 46.7 51.3 52.1 X-Y-d(Y)/dt 67.1 65.0 70.8 70.8 70.4 72.5 70.8 Figure 1: A frame of Figure 2: The basic geometrical visual X-Y-angle 50.1 51.7 50.4 51.7 52.5 50.8 50.8 face image with six blue features from lip X-Y- 63.3 68.3 72.1 72.5 67.9 73.3 74.6 points d(angle)/dt Single Visual Feature Experiment Recognition Rates (%) X-Y-d(Y)/dt- 55.4 60.8 60.0 59.2 60.0 62.5 60.8 d(Y/X)/dt Feature X Y Y/X area angle d(X) d(Y) d(Y/X) d(area) d(angle) X-Y-angle- 60.4 60.0 60.8 62.1 59.2 60.0 60.8 Digit /dt /dt /dt /dt /dt d(angle)/dt One 35.4 37.5 31.3 23.6 29.2 22.9 29.9 36.1 35.4 23.6 Two 30.6 31.3 34.7 14.6 35.4 27.8 21.2 25.0 17.4 27.1 Table 4: Evaluation of HMM models for combined visual features Three 37.5 19.4 27.8 15.9 29.2 34.0 16.7 19.4 25.7 22.2 (N: number of states; M: number of Gaussian mixture compo- Four 38.9 51.4 20.8 34.0 21.6 38.2 50.7 34.0 44.4 35.4 nents) Five 52.8 72.2 33.3 36.8 39.6 59.0 68.8 66.7 61.8 65.9 Six 48.6 22.9 28.5 16.7 29.9 13.9 6.90 8.30 17.4 6.30 visual feature combination is [X, Y, d(angle)/d(t)] and the best Seven 29.9 58.3 49.3 22.2 45.8 20.1 68.1 56.9 50.7 60.4 model for this combination is HMM with five states and thirty-two Eight 34.7 34.0 16.7 10.4 28.5 27.1 45.8 26.4 22.2 22.2 mixtures. A more important conclusion from these result is that al- Nine 38.2 36.1 24.3 7.60 20.8 9.7 31.3 27.8 22.9 25.7 though single visual features do not perform very well when used Zero 6.90 15.3 12.5 33.3 20.8 16.7 25.7 31.3 26.4 34.7 singly, when they are combined to form a combined visual feature, A.R. 35.4 37.9 27.9 21.5 30.0 26.9 37.3 33.2 32.4 32.4 the recognition performance improves dramatically and becomes Rate more robust and reasonable. After the experiments on single and combined visual features, Table 3: The recognition rate for single visual feature experiment We decide to use the combination of [X, Y, d(angle)/d(t)] as the visual feature vector for bimodal speech recognition experiments. contours are important for distinguishing the digits. If the four ba- In the next section we discuss about audio-visual HMM modeling sic geometrical lip parameters are thought as a function of time, and issues in audio and video fusion. then their first order derivatives over time indicate the changes in these parameters over time. For visual HMM modeling we exper- 3. FUSION FOR AUDIO-VISUAL HMM MODELING iment ten single visual features. They are X, Y, area, angle, plus an artificially made parameter Y/X and their first order derivatives. Acoustic signal is susceptible to the acoustic noise, therefore, in noisy environments acoustic signal is not enough to identify speech, We carry out the analysis of geometric visual features in two as proven in Table 1. Speech production and perception systems stages. At the first stage, we determine the best single visual fea- of humans are bimodal in nature. The visual speech recognition tures for each digit classifier . For this purpose, for each digit and experiments shown in section 2.3 verified that visual signal carries for each single visual feature six HMMs having four or five states relevant information about speech recognition. Taking the advan- with four, eight and sixteen Gaussian mixtures respectively were tage of both acoustic and visual information of the speech, there- trained. We evaluate the recognition performances of the trained fore, is a natural and robust method of recognition. Here, we ex- models against the checking data. According to the experimen- plain how the audio and visual features can be fused and modeled tal results, the average recognition rates, which are determined for bimodal speech recognition. over all the digits and all the HMM models, are shown in the Ta- In the previous section, we decided to use the combination of ble 3. From these results, we conclude that the geometric visual [X, Y, d(angle)/d(t)] as our visual features. On the other hand for features along the vertical directions are more important than the the acoustic features we used the thirteen dimensional MFCC. A ones along the horizontal direction. And the single visual feature very important problem of audio-visual speech recognition is to d(angle)/d(t), representing coordinative lip movements along both fuse the acoustic and visual speech information. There are two the vertical and horizontal directions, is an important feature which main approaches to solve this problem, one is early integration can not be ignored. From Table 3, it can be concluded that by taking into account the average recognition rates, X, Y,angle, d(Y)/d(t), d(angle)/d(t), and d(Y/X)/d(t) are comparatively ideal geometric single visual features for bimodal speech recognition. At this stage, the question is to decide which combination of these single geometric visual features should be used for bimodal speech recognition. In order to find the best combined visual features, we conduct experiments on seven combinations of these six commonly well-performed sin- gle visual features for all ten digits. We also try seven different HMMs for these visual feature digit models in the experiments. The results are shown in Table 4. According to this table, the best Figure 3: DI-based early integration system 203 2001 IEEE International Conference on Multimedia and Expo 0-7695-1198-8/01/$10.00 (C) 2001 IEEE ISBN 0-7695-1198-8/01 $17.00 © 2001 IEEE Audio-Visual Speech Recognition N 4 4 4 5 5 5 M 8 16 32 8 16 32 SNR Average Recognition Rate (%) Clean 96.3 97.9 98.8 96.7 97.5 98.3 30dB 96.3 97.9 98.8 97.1 97.5 97.9 25dB 96.7 97.9 98.8 97.1 97.5 97.9 20dB 96.3 97.5 98.8 96.7 96.7 97.9 15dB 96.3 96.3 98.3 94.6 96.7 97.9 10dB 95.8 95.4 97.9 93.8 95.8 96.3 5dB 88.3 94.2 94.2 90.0 92.1 92.9 0dB 80.4 83.3 82.1 82.5 81.7 81.7 Table 5: Audio-Visual speech recognition using different digit models under various acoustic SNR’s (N: number of states, M: number of Gaussian mixture components) Figure 4: Recognition Rate using different speech information (feature fusion and one recognition engine) and another one is late sources under various acoustic SNR’s integration (two different recognition engines and identification fusion). 4. CONCLUSION In this research, we develop an early integration system through In this paper, we describe an audio-visual speech recognition sys- the direct identification (DI) strategy in Fig. 3. Because of the dif- tem for multi-speaker recognition for isolated digit recognition. ferent sampling rates of acoustic and visual features, it’s not easy We test our system for different acoustic noise levels and compare to combine these two kinds of features into joint vectors. This its performance with visual and acoustic only speech recognition. is a problem of the non-synchronism between audio and visual Geometric visual features are analyzed for bimodal speech recog- channels when merging the acoustic and visual signals. We solved nition. We also present an early integration algorithm for the fu- this problem without loosing any information from acoustic and sion of the acoustic and visual speech information. Audio-visual visual channels. For this purpose, we first upsample each visual speech recognition system improves the recognition rate especially single feature signal using low-pass interpolation. Then we obtain at high noise levels when compared with acoustic based speech our new visual features from 25msec windows with 10msec over- recognition lapping (same sampling method as the acoustic features), thus the visual feature sequence has the same number of samples as the cor- 5. ACKNOWLEDGMENT responding acoustic feature sequence. After this interpolation we concatenate our four dimensional visual and thirteen dimensional The authors would like to acknowledge Dr Chng Eng Siong, com- acoustic feature vectors to form a seventeen dimensional joint fea- ing from Knowles Electronics in Singapore, for his constant en- ture vector. couragement and support. After obtaining our audio-visual feature vector, the next step is 6. REFERENCES to train the HMM models for each digit. Six different HMMs with four, five states and eight, sixteen, thirty-two Gaussian mixtures [1] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Na- were trained. We evaluate the performances of these HMM models ture, vol. 264, no. 5588, pp. 746–8, 1976. using the checking data which includes clean and noise-corrupted [2] B. Yuhas, J. Goldstein, M.H., T. Sejnowski, and R. Jenkins, “Neural data. The average recognition accuracy (over all the digits) are network models of sensory integration for improved vowel recogni- shown in Table 5. tion,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1658–68, 1990. [3] E. Petajan, “Automatic lipreading to enhance speech recognition,” We carry out three kinds of digit recognition experiments. They IEEE Global Telecommunications Conference, GLOBECOM ’84 Con- were acoustic feature only (section2.2) visual feature only (sec- ference Record. ’Communications in the Information Age’ (Cat. No. tion2.3) and bimodal feature (section3) recognition. We fix the 84CH2064-4), pp. 265–72 vol.1, 1984. HMM model to be four state and thirty-two Gaussian mixtures, [4] D. Stork, G. Wolff, and E. Levine, “Neural network lipreading system which is the common best model for the three recognition engines, for improved speech recognition,” IJCNN International Joint Confer- and plot the average recognition accuracy (over all the digits) of ence on Neural Networks (Cat. No.92CH3114-6), pp. 289–95 vol.2, 1992. these three recognition experiments in Fig. 4. Comparing these av- erage recognition accuracy, it is found, firstly, visual-only recogni- [5] M. Tomlinson, M. Russell, and N. Brooke, “Integrating audio and vi- sual information to provide highly robust speech recognition,” 1996 tion has a accuracy of about 75%, which is insensitive to the noise. IEEE International Conference on Acoustics, Speech, and Signal Pro- Secondly, acoustic-only recognition, although performed well on cessing Conference Proceedings, pp. 821–4 vol. 2, 1996. clean data, its recognition accuracy decreased greatly when noise [6] S. Dupont and J. Luettin, “Audio-visual speech modeling for contin- increased. Thirdly, bimodal recognition became more robust to uous speech recognition,” IEEE Transactions on Multimedia, vol. 2, noise than the acoustic-only recognition and more accurate than no. 3, pp. 141–51, 2000. the visual-only recognition. Bimodal speech recognition system [7] L. Rabiner and B. H. Juang, “Fundamentals of speech recognition,” out-performed both acoustic and visual only speech recognition Prentice Hall Signal Processing Series, 1993. systems. 204 2001 IEEE International Conference on Multimedia and Expo 0-7695-1198-8/01/$10.00 (C) 2001 IEEE ISBN 0-7695-1198-8/01 $17.00 © 2001 IEEE