Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval
Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing…
Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previ- ous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy compared to QAT alone. However, the optimal allocation of compute between the FP and…
Recent advances in deep learning and automatic speech recognition have boosted the accuracy of end-to-end speech recognition to a new level. However, recognition of personal content such as contact names remains a challenge. In this work, we present a personalization solution for an end-to-end system based on connectionist temporal classification.…
Quantization, knowledge distillation, and magnitude pruning are among the most popular methods for neural network compression in NLP. Independently, these methods reduce model size and can accelerate inference, but their relative benefit and combinatorial inter- actions have not been rigorously studied. For each of the eight possible subsets of these…