Study on Speech Recognition Optimization of Cross-modal Attention Mechanism in Low Resource Scenarios
DOI: 10.23977/jeis.2024.090318 | Downloads: 7 | Views: 202
Author(s)
Zhuoxi Li 1
Affiliation(s)
1 School of Computing, Guangdong Neusoft University, Foshan, China
Corresponding Author
Zhuoxi LiABSTRACT
With the rapid development of artificial intelligence technology, speech recognition technology has become one of the important interfaces of human-computer interaction, and its accuracy and robustness are crucial to user experience. However, in low-resource scenarios, such as noise interference, dialect accents, and limited labeled data, the performance of speech recognition systems often deteriorates significantly. To solve this problem, a speech recognition optimization method based on cross-modal attention mechanism is proposed in this paper. As a new technique, cross-modal attention mechanism provides a new idea for speech recognition in low resource scenarios. By integrating information from different modes (such as vision, text, etc.), the mechanism makes use of the complementarity between them to enhance the recognition ability of the model. In speech recognition tasks, audio signals and visual information associated with them (such as lip movements, gestures, etc.) are often strongly correlated. Through the cross-modal attention mechanism, the model can pay more attention to the visual features closely related to the speech content, so as to achieve accurate recognition of audio signals. This paper first introduces speech recognition technology and its challenges in low resource scenarios, and discusses its application strategy in low resource speech recognition in detail by analyzing the basic principle of cross-modal attention mechanism. By introducing an attention mechanism, the neural network can automatically learn and selectively focus on important information in the input, thereby improving the performance and generalization ability of the model.
KEYWORDS
Cross-modal attention mechanism; Low resource scenario; Speech recognition; Multimodal information fusionCITE THIS PAPER
Zhuoxi Li, Study on Speech Recognition Optimization of Cross-modal Attention Mechanism in Low Resource Scenarios. Journal of Electronics and Information Science (2024) Vol. 9: 135-141. DOI: http://dx.doi.org/10.23977/10.23977/jeis.2024.090318.
REFERENCES
[1] Mao J,Shi H,Li X .Research on multimodal hate speech detection based on self-attention mechanism feature fusion[J].The Journal of Supercomputing,2024,81(1):28-29.
[2] Abderrazzaq M,David R,Pejman R .Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification[J].Sensors (Basel, Switzerland),2023,23(13)
[3] Lin F,Yao L L, Lan S L, et al.Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism[J].Multimedia Tools and Applications,2023,82(19):28917-28935.
[4] Yang L, Haoqin S, Wenbo G, et al.Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework[J].Speech Communication,2022,139.
[5] Mehra S, Ranga V, Agarwal R .Dhivehi Speech Recognition: A Multimodal Approach for Dhivehi Language in Resource-Constrained Settings[J].Circuits, Systems, and Signal Processing,2024,(prepublish):1-21.
[6] Zhou K, Yang J, Loy C C, et al. Learning to prompt for vision-language models[J]. International Journal of Computer Vision, 2022, 130(9):2337-2348.
[7] Li J, Selvaraju R, Gotmare A, et al. Align before fuse: Vision and language representation learning with momentum distillation[J].Advances in neural information processing systems,2021,34:9694-9705.
[8] Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in neural information processing systems, 2020, 33: 1877-1901.
[9] Lu J, Batra D, Parikh D, et al. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks [J].Advances in neural information processing systems, 2019, 32.
[10] Frome A, Corrado G S, Shlens J, et al. Devise: A deep visual-semantic embedding model[J].Advances in neural information processing systems,2013, 26.
Downloads: | 10606 |
---|---|
Visits: | 361352 |
Sponsors, Associates, and Links
-
Information Systems and Signal Processing Journal
-
Intelligent Robots and Systems
-
Journal of Image, Video and Signals
-
Transactions on Real-Time and Embedded Systems
-
Journal of Electromagnetic Interference and Compatibility
-
Acoustics, Speech and Signal Processing
-
Journal of Power Electronics, Machines and Drives
-
Journal of Electro Optics and Lasers
-
Journal of Integrated Circuits Design and Test
-
Journal of Ultrasonics
-
Antennas and Propagation
-
Optical Communications
-
Solid-State Circuits and Systems-on-a-Chip
-
Field-Programmable Gate Arrays
-
Vehicular Electronics and Safety
-
Optical Fiber Sensor and Communication
-
Journal of Low Power Electronics and Design
-
Infrared and Millimeter Wave
-
Detection Technology and Automation Equipment
-
Journal of Radio and Wireless
-
Journal of Microwave and Terahertz Engineering
-
Journal of Communication, Control and Computing
-
International Journal of Surveying and Mapping
-
Information Retrieval, Systems and Services
-
Journal of Biometrics, Identity and Security
-
Journal of Avionics, Radar and Sonar