Education, Science, Technology, Innovation and Life
Open Access
Sign In

Semi-supervised End-to-end Speech Recognition

Download as PDF

DOI: 10.23977/acss.2023.070305 | Downloads: 23 | Views: 435

Author(s)

Jiewen Ning 1, Yugang Dai 1, Guanyu Li 1, Sirui Li 1, Senyan Li 1

Affiliation(s)

1 Northwest Minzu University, Lanzhou, Gansu, 730000, China

Corresponding Author

Yugang Dai

ABSTRACT

The current popular end-to-end speech recognition technology requires a large amount of labeled data, which limits the development of speech recognition technology in low resource languages. In this paper, we propose speech recognition schemes based on semi-supervised learning methods. We try to influence unlabeled speech-to-text mapping by learning unlabeled text-to-text mapping using a shared encoder. The proposed scheme uses Conformer network as a shared encoder, which can extract both text features and speech features. Using CTC/Attention network as decoder, the model is iteratively self-trained using a small amount of labeled data and a large amount of unlabeled data. The WER of the rescoring results on the Aishell-1 dataset was 12.54, a 37% reduction in WER compared with the baseline system. Compared with the state of art supervised system, we use less labeled data and only improve the error rate by 56%, which shows that our method has great potential in semi-supervised speech recognition.

KEYWORDS

End to end, Automatic Speech Recognition, semi-supervised, shared encoder, conformer

CITE THIS PAPER

Jiewen Ning, Yugang Dai, Guanyu Li, Sirui Li, Senyan Li. Semi-supervised End-to-end Speech Recognition. Advances in Computer, Signals and Systems (2023) Vol. 7: 33-38. DOI: http://dx.doi.org/10.23977/acss.2023.070305.

REFERENCES

[1] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in neural information processing systems, 2017, 30.
[2] Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv: 2005. 08100, 2020.
[3] Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376.
[4] Chorowski J, Bahdanau D, Cho K, et al. End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv preprint arXiv:1412. 1602, 2014.
[5] Chorowski J, Bahdanau D, Serdyuk D, et al. Attention-Based Models for Speech Recognition. Computerence, 2015, 10(4):429-439
[6] Graves A. Sequence Transduction with Recurrent Neural Networks. Computer Science, 2012, 58(3):235-242.
[7] Chapelle O, Scholkopf B, Zien A. Semi-supervised learning (chapelle, o. et al., eds.; 2006) [book reviews]. IEEE Transactions on Neural Networks, 2009, 20(3): 542-542.
[8] Ouali Y, Hudelot C, Tami M. An overview of deep semi-supervised learning. arXiv preprint arXiv:2006.05278, 2020.
[9] Higuchi Y, Moritz N, Roux J L, et al. Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition. 2021.
[10] Zhang Y, Park D S, Han W, et al. BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Auto Lample G, Denoyer L, Ranzato M. Unsupervised Machine Translation Using Monolingual Corpora Only/ 2017.atic Speech Recognition. 2021.
[11] Karita S, Watanabe S, Iwata T, et al. Semi-Supervised End-to-End Speech Recognition//Interspeech. 2018: 2-6.
[12] Chan W, Jaitly N, Le Q, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition// 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016.
[13] Dai Z, Yang Z, Yang Y, et al. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
[14] Bu H, Du J, Na X, et al. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline//2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 2017: 1-5.
[15] Yao Z, Wu D, Wang X, et al. WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit. 2021.
[16] Zhang B, Wu D, Peng Z, et al. WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit. 2022.
[17] Chickering D M. Optimal structure identification with greedy search. Journal of machine learning research, 2002, 3(Nov): 507-554.
[18] Hannun A Y, Maas A L, Jurafsky D, et al. First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs. arXiv preprint arXiv: 1408. 2873, 2014.
[19] Hori T, Hori C, Minami Y, et al. Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition. IEEE Transactions on audio, speech, and language processing, 2007, 15(4): 1352-1365.
[20] Aleksic P, Ghodsi M, Michaely A, et al. Bringing Contextual Information to Google Speech Recognition// International Conference on Concurrency Theory. Springer-Verlag, 2015.

Downloads: 13415
Visits: 257980

Sponsors, Associates, and Links


All published work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright © 2016 - 2031 Clausius Scientific Press Inc. All Rights Reserved.