Education, Science, Technology, Innovation and Life
Open Access
Sign In

Study on Speech Recognition Optimization of Cross-modal Attention Mechanism in Low Resource Scenarios

Download as PDF

DOI: 10.23977/jeis.2024.090318 | Downloads: 7 | Views: 202

Author(s)

Zhuoxi Li 1

Affiliation(s)

1 School of Computing, Guangdong Neusoft University, Foshan, China

Corresponding Author

Zhuoxi Li

ABSTRACT

With the rapid development of artificial intelligence technology, speech recognition technology has become one of the important interfaces of human-computer interaction, and its accuracy and robustness are crucial to user experience. However, in low-resource scenarios, such as noise interference, dialect accents, and limited labeled data, the performance of speech recognition systems often deteriorates significantly. To solve this problem, a speech recognition optimization method based on cross-modal attention mechanism is proposed in this paper. As a new technique, cross-modal attention mechanism provides a new idea for speech recognition in low resource scenarios. By integrating information from different modes (such as vision, text, etc.), the mechanism makes use of the complementarity between them to enhance the recognition ability of the model. In speech recognition tasks, audio signals and visual information associated with them (such as lip movements, gestures, etc.) are often strongly correlated. Through the cross-modal attention mechanism, the model can pay more attention to the visual features closely related to the speech content, so as to achieve accurate recognition of audio signals. This paper first introduces speech recognition technology and its challenges in low resource scenarios, and discusses its application strategy in low resource speech recognition in detail by analyzing the basic principle of cross-modal attention mechanism. By introducing an attention mechanism, the neural network can automatically learn and selectively focus on important information in the input, thereby improving the performance and generalization ability of the model.

KEYWORDS

Cross-modal attention mechanism; Low resource scenario; Speech recognition; Multimodal information fusion

CITE THIS PAPER

Zhuoxi Li, Study on Speech Recognition Optimization of Cross-modal Attention Mechanism in Low Resource Scenarios. Journal of Electronics and Information Science (2024) Vol. 9: 135-141. DOI: http://dx.doi.org/10.23977/10.23977/jeis.2024.090318.

REFERENCES

[1] Mao J,Shi H,Li X .Research on multimodal hate speech detection based on self-attention mechanism feature fusion[J].The Journal of Supercomputing,2024,81(1):28-29.
[2] Abderrazzaq M,David R,Pejman R .Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification[J].Sensors (Basel, Switzerland),2023,23(13)
[3] Lin F,Yao L L, Lan S L, et al.Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism[J].Multimedia Tools and Applications,2023,82(19):28917-28935.
[4] Yang L, Haoqin S, Wenbo G, et al.Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework[J].Speech Communication,2022,139.
[5] Mehra S, Ranga V, Agarwal R .Dhivehi Speech Recognition: A Multimodal Approach for Dhivehi Language in Resource-Constrained Settings[J].Circuits, Systems, and Signal Processing,2024,(prepublish):1-21.
[6] Zhou K, Yang J, Loy C C, et al. Learning to prompt for vision-language models[J]. International Journal of Computer Vision, 2022, 130(9):2337-2348. 
[7] Li J, Selvaraju R, Gotmare A, et al. Align before fuse: Vision and language representation learning with momentum distillation[J].Advances in neural information processing systems,2021,34:9694-9705.
[8] Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in neural information processing systems, 2020, 33: 1877-1901.
[9] Lu J, Batra D, Parikh D, et al. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks [J].Advances in neural information processing systems, 2019, 32.
[10] Frome A, Corrado G S, Shlens J, et al. Devise: A deep visual-semantic embedding model[J].Advances in neural information processing systems,2013, 26. 

Downloads: 10606
Visits: 361352

Sponsors, Associates, and Links


All published work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright © 2016 - 2031 Clausius Scientific Press Inc. All Rights Reserved.