Detection of Emergency Words with Automatic Image Based Lip Reading Method

Beyza Ülkümen; Ali Öztürk

doi:10.58190/imiens.2024.81

Authors

Beyza Ülkümen KTO Karatay University, Konya, TURKIYE https://orcid.org/0009-0001-0122-5733
Ali Öztürk KTO Karatay University, Konya, TURKIYE https://orcid.org/0000-0002-1797-2039

DOI:

https://doi.org/10.58190/imiens.2024.81

Keywords:

lip reading, Convolutional neural network, SSD

Abstract

Lip reading automation can play a crucial role in ensuring or enhancing security at noisy and large-scale events such as concerts, rallies, public meetings, and more by detecting emergency keywords. In this study, the aim is to automatically detect emergency words from the lip movements of a person using images extracted from silent video frames. To achieve this goal, an original dataset consisting of silent video images in which 8 emergency words were spoken by different 14 speakers was used. The lip regions of the images obtained from the videos in the dataset were labeled through relevant region detection. Labeled data were then evaluated using the SSD (Single Shot MultiBox Detector) deep learning method. Subsequently, subsets of labeled data with 8, 6, and 4 classes were created. The SSD algorithm was evaluated separately for each of these subsets. During the training of the SSD algorithm, weight initialization methods such as 'he,' 'glorot,' and 'narrow-normal' were used, and their performances were compared. Additionally, the SSD algorithm was trained with two different values of the maxepochs parameter, which were 20 and 30, respectively. According to the results, the lowest accuracy value was found for the 8-class subset, with an accuracy of 42% obtained using 20 epochs of training and the 'narrow-normal' weight initialization method. The highest accuracy value was achieved for the 4-class subset, with an accuracy of 76% obtained using the 30 epochs of training and the 'glorot' weight initialization method.

Downloads

Download data is not yet available.

References

Potamianos, G.; Neti, C.; Gravier, G.; Garg, A.; Senior, A.W. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 2003, 91, 1306–1326.

Akhtar, Z.; Micheloni, C.; Foresti, G.L. Biometric liveness detection: Challenges and research opportunities. IEEE Secur. Priv. 2015, 13, 63–72.

Rekik, A.; Ben-Hamadou, A.; Mahdi, W. Human machine interaction via visual speech spotting. In International Conference on Advanced Concepts for Intelligent Vision Systems; Springer: Berlin/Heidelberg, Germany, 2015; pp. 566–574.

W. H. Sumby and I. Pollack, Erratum: Visual contribution to speech intelligibility in noise, J. Acoust. Soc. Am. 26(2) (1954) 212–215.

T. J. Hazen, K. Saenko, C.-H. La, and J. R. Glass, ‘‘A segment- based audio-visual speech recognizer: Data collection, development, and initial experiments,’’ in Proc. 6th Int. Conf. Multimodal Interface (ICMI), 2004, pp. 235–242.

S. Lee and D. Yook, ‘‘Audio-to-visual conversion using hidden Markov models,’’ in Proc. 7th Pacific Rim Int. Conf. Artif. Intell., Trends Artif. Intell., 2002, pp. 563–570.

I. Almajai, S. Cox, R. Harvey, and Y. Lan, ‘‘Improved speaker independent lip reading using speaker adaptive training and deep neural networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2016, pp. 2722–2726.

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, ‘‘Multimodal deep learning,’’ in Proc. 28th Int. Conf. Mach. Learn., (ICML), 2011, pp. 1–8.

J. Huang and B. Kingsbury, ‘‘Audio-visual deep learning for noise robust speech recognition,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., May 2013, pp. 7596–7599.

K. Thangthai, R. Harvey, S. Cox, and B.-J. Theobald, ‘‘Improving lipreading performance for robust audiovisual speech recognition using DNNs,’’ in Proc. Int. Conf. Auditory-Visual Speech Process., Sep. 2015, pp. 127–131.

M. Kass, A. Witkin, and D. Terzopoulos, ‘‘Snakes: Active contour models,’’ Int. J. Comput. Vis., vol. 1, no. 4, pp. 321–331, Jan. 1988.

Q. Dinh Nguyen and M. Milgram, ‘‘Multi features active shape models for lip contours detection,’’ in Proc. Int. Conf. Wavelet Anal. Pattern Recognit., vol. 1, Aug. 2008, pp. 172–176.

H. Lee, C. Ekanadham, and A. Y. Ng, ‘‘Sparse deep belief net model for visual area V2,’’ in Proc. Adv. Neural Inf. Process. Syst., 2008, pp. 873–880.

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]

W. Liu, D. Anguelov, D. Erhan, S. Christian, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV, 2016. 1, 3, 4, 6, 7, 8.

S. Ren, K. He, R. Girshick, et al. (2016). "Faster R-CNN: Rapid Object Detection." Presented at IEEE Conference on Neural Information Processing Systems (NIPS), 2016.

S. Dupont and J. Luettin, ‘‘Audio-visual speech modeling for continuous speech recognition,’’ IEEE Trans. Multimedia, vol. 2, no. 3, pp. 141–151, Sep. 2000.

Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen, ‘‘A review of recent advances in visual speech decoding,’’ Image Vis. Comput., vol. 32, no. 9, pp. 590–605, Sep. 2014.

Burton, J., Frank, D., Saleh, M., Navab, N. and Bear,H.L., 2018, December. The speaker-independent lipreading play-off; a survey of lipreading machines. In 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS) (pp. 125- 130). IEEE.

Fenghour, S., Chen, D., Guo, K. and Xiao, P., 2020. Lip Reading Sentences Using Deep Learning with Only Visual Cues. IEEE Access, 8, pp.215516-215530.

Fung, I. and Mak, B., 2018, April. End-to-end low-resource lip-reading with maxout CNN and LSTM. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2511-2515). IEEE.

Wand, M., Koutník, J. and Schmidhuber, J., 2016, March. Lipreading with long short-term memory. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6115-6119). IEEE.

Kastaniotis, D., Tsourounis, D. and Fotopoulos, S., 2020, October. Lip Reading modeling with Temporal Convolutional Networks for medical support applications. In 2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) (pp. 366-371). IEEE.

Mesbah, A., Berrahou, A., Hammouchi, H., Berbia, H., Qjidaa, H. and Daoudi, M., 2019. Lip reading with Hahn convolutional neural networks. Image and Vision Computing, 88, pp.76-83.

Puviarasan, N. and Palanivel, S., 2011. Lip reading of hearing-impaired persons using HMM. Expert Systems with Applications, 38(4), pp.4477-4481.

Kulkarni, A.H. and Kirange, D., 2019, July. Artificial Intelligence: A Survey on Lip-Reading Techniques. In 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT) (pp. 1-5). IEEE.