Semi-supervised End-to-end Speech Recognition
DOI: 10.23977/acss.2023.070305 | Downloads: 23 | Views: 435
Author(s)
Jiewen Ning 1, Yugang Dai 1, Guanyu Li 1, Sirui Li 1, Senyan Li 1
Affiliation(s)
1 Northwest Minzu University, Lanzhou, Gansu, 730000, China
Corresponding Author
Yugang DaiABSTRACT
The current popular end-to-end speech recognition technology requires a large amount of labeled data, which limits the development of speech recognition technology in low resource languages. In this paper, we propose speech recognition schemes based on semi-supervised learning methods. We try to influence unlabeled speech-to-text mapping by learning unlabeled text-to-text mapping using a shared encoder. The proposed scheme uses Conformer network as a shared encoder, which can extract both text features and speech features. Using CTC/Attention network as decoder, the model is iteratively self-trained using a small amount of labeled data and a large amount of unlabeled data. The WER of the rescoring results on the Aishell-1 dataset was 12.54, a 37% reduction in WER compared with the baseline system. Compared with the state of art supervised system, we use less labeled data and only improve the error rate by 56%, which shows that our method has great potential in semi-supervised speech recognition.
KEYWORDS
End to end, Automatic Speech Recognition, semi-supervised, shared encoder, conformerCITE THIS PAPER
Jiewen Ning, Yugang Dai, Guanyu Li, Sirui Li, Senyan Li. Semi-supervised End-to-end Speech Recognition. Advances in Computer, Signals and Systems (2023) Vol. 7: 33-38. DOI: http://dx.doi.org/10.23977/acss.2023.070305.
REFERENCES
[1] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in neural information processing systems, 2017, 30.
[2] Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv: 2005. 08100, 2020.
[3] Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376.
[4] Chorowski J, Bahdanau D, Cho K, et al. End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv preprint arXiv:1412. 1602, 2014.
[5] Chorowski J, Bahdanau D, Serdyuk D, et al. Attention-Based Models for Speech Recognition. Computerence, 2015, 10(4):429-439
[6] Graves A. Sequence Transduction with Recurrent Neural Networks. Computer Science, 2012, 58(3):235-242.
[7] Chapelle O, Scholkopf B, Zien A. Semi-supervised learning (chapelle, o. et al., eds.; 2006) [book reviews]. IEEE Transactions on Neural Networks, 2009, 20(3): 542-542.
[8] Ouali Y, Hudelot C, Tami M. An overview of deep semi-supervised learning. arXiv preprint arXiv:2006.05278, 2020.
[9] Higuchi Y, Moritz N, Roux J L, et al. Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition. 2021.
[10] Zhang Y, Park D S, Han W, et al. BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Auto Lample G, Denoyer L, Ranzato M. Unsupervised Machine Translation Using Monolingual Corpora Only/ 2017.atic Speech Recognition. 2021.
[11] Karita S, Watanabe S, Iwata T, et al. Semi-Supervised End-to-End Speech Recognition//Interspeech. 2018: 2-6.
[12] Chan W, Jaitly N, Le Q, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition// 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.
[13] Dai Z, Yang Z, Yang Y, et al. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
[14] Bu H, Du J, Na X, et al. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline//2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 2017: 1-5.
[15] Yao Z, Wu D, Wang X, et al. WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit. 2021.
[16] Zhang B, Wu D, Peng Z, et al. WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit. 2022.
[17] Chickering D M. Optimal structure identification with greedy search. Journal of machine learning research, 2002, 3(Nov): 507-554.
[18] Hannun A Y, Maas A L, Jurafsky D, et al. First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. arXiv preprint arXiv: 1408. 2873, 2014.
[19] Hori T, Hori C, Minami Y, et al. Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition. IEEE Transactions on audio, speech, and language processing, 2007, 15(4): 1352-1365.
[20] Aleksic P, Ghodsi M, Michaely A, et al. Bringing Contextual Information to Google Speech Recognition// International Conference on Concurrency Theory. Springer-Verlag, 2015.
Downloads: | 13415 |
---|---|
Visits: | 257980 |
Sponsors, Associates, and Links
-
Power Systems Computation
-
Internet of Things (IoT) and Engineering Applications
-
Computing, Performance and Communication Systems
-
Journal of Artificial Intelligence Practice
-
Journal of Network Computing and Applications
-
Journal of Web Systems and Applications
-
Journal of Electrotechnology, Electrical Engineering and Management
-
Journal of Wireless Sensors and Sensor Networks
-
Journal of Image Processing Theory and Applications
-
Mobile Computing and Networking
-
Vehicle Power and Propulsion
-
Frontiers in Computer Vision and Pattern Recognition
-
Knowledge Discovery and Data Mining Letters
-
Big Data Analysis and Cloud Computing
-
Electrical Insulation and Dielectrics
-
Crypto and Information Security
-
Journal of Neural Information Processing
-
Collaborative and Social Computing
-
International Journal of Network and Communication Technology
-
File and Storage Technologies
-
Frontiers in Genetic and Evolutionary Computation
-
Optical Network Design and Modeling
-
Journal of Virtual Reality and Artificial Intelligence
-
Natural Language Processing and Speech Recognition
-
Journal of High-Voltage
-
Programming Languages and Operating Systems
-
Visual Communications and Image Processing
-
Journal of Systems Analysis and Integration
-
Knowledge Representation and Automated Reasoning
-
Review of Information Display Techniques
-
Data and Knowledge Engineering
-
Journal of Database Systems
-
Journal of Cluster and Grid Computing
-
Cloud and Service-Oriented Computing
-
Journal of Networking, Architecture and Storage
-
Journal of Software Engineering and Metrics
-
Visualization Techniques
-
Journal of Parallel and Distributed Processing
-
Journal of Modeling, Analysis and Simulation
-
Journal of Privacy, Trust and Security
-
Journal of Cognitive Informatics and Cognitive Computing
-
Lecture Notes on Wireless Networks and Communications
-
International Journal of Computer and Communications Security
-
Journal of Multimedia Techniques
-
Automation and Machine Learning
-
Computational Linguistics Letters
-
Journal of Computer Architecture and Design
-
Journal of Ubiquitous and Future Networks