Categories: FAANG

Introducing Whisper

We’ve trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition.

Read Paper
View Code
View Model Card

Whisper examples:

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing.

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Other existing approaches frequently use smaller, more closely paired audio-text training datasets, or use broad but unsupervised audio pretraining. Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.

About a third of Whisper’s audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.

We hope Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications. Check out the paper, model card, and code to learn more details and to try out Whisper.


References
  1. Chan, W., Park, D., Lee, C., Zhang, Y., Le, Q., and Norouzi, M. SpeechStew: Simply mix all available speech recogni- tion data to train one large neural network. arXiv preprint arXiv:2104.02133, 2021.
  2. Galvez, D., Diamos, G., Torres, J. M. C., Achorn, K., Gopi, A., Kanter, D., Lam, M., Mazumder, M., and Reddi, V. J. The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage. arXiv preprint arXiv:2111.09344, 2021.
  3. Chen, G., Chai, S., Wang, G., Du, J., Zhang, W.-Q., Weng, C., Su, D., Povey, D., Trmal, J., Zhang, J., et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909, 2021.
  4. Baevski, A., Zhou, H., Mohamed, A., and Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477, 2020.
  5. Baevski, A., Hsu, W.N., Conneau, A., and Auli, M. Unsu pervised speech recognition. Advances in Neural Information Processing Systems, 34:27826–27839, 2021.
  6. Zhang, Y., Park, D. S., Han, W., Qin, J., Gulati, A., Shor, J., Jansen, A., Xu, Y., Huang, Y., Wang, S., et al. BigSSL: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2109.13226, 2021.

Note: Whisper transcribed “Eildons” as “Yildens”` } }

const root = document.querySelector('.js-root'); const audio = document.querySelector('.js-audio'); const reveal = document.querySelector('.js-reveal'); const playButton = document.querySelector('.js-play-button'); const playIcon = document.querySelector('.js-play-icon'); const pauseIcon = document.querySelector('.js-pause-icon'); const [unplayedSoundWave, playedSoundWave] = Array.from(document.querySelectorAll('.js-soundwave')); const exampleSelect = document.querySelector('.js-example-select'); const output = document.querySelector('.js-output');

let playing = false;

const handlePlay = () => { playIcon.style.display = 'none'; pauseIcon.style.display = 'block'; root.classList.add('playing'); }

const handlePause = () => { playIcon.style.display = 'block'; pauseIcon.style.display = 'none'; root.classList.remove('playing'); }

playButton.addEventListener('click', () => { if (playing) { audio.pause(); playing = false; handlePause(); } else { audio.play(); playing = true; handlePlay(); } });

audio.addEventListener('ended', () => { playing = false; handlePause(); output.innerHTML = EXAMPLES[exampleSelect.value]?.transcript; reveal.style.display = 'none'; });

audio.addEventListener('timeupdate', () => { const percent = audio.currentTime / audio.duration; playedSoundWave.style.clipPath = `polygon(0 0, ${percent * 100}% 0, ${percent * 100}% 100%, 0 100%)`; });

exampleSelect.addEventListener('change', () => { // Pause the current example audio.pause(); playing = false;

reveal.style.display = 'block'; output.innerText = '';

// Reset mask playedSoundWave.style.clipPath = `polygon(0 0, 0 0, 0 100%, 0 100%)`;

// Update the player audio.src = EXAMPLES[exampleSelect.value].src;

handlePause(); updateMasks(); });

reveal.addEventListener('click', () => { reveal.style.display = 'none'; output.innerHTML = EXAMPLES[exampleSelect.value].transcript; });

const buildSelect = () => { const fragment = document.createDocumentFragment();

Object.keys(EXAMPLES).forEach((key) => { const option = document.createElement('option'); option.value = key; option.innerText = EXAMPLES[key].label; fragment.appendChild(option); });

exampleSelect.appendChild(fragment); }

const updateMasks = () => { const value = exampleSelect.value;

[unplayedSoundWave, playedSoundWave].forEach((el) => { el.style.maskImage = `url(${EXAMPLES[value].waveform})`; el.style.webkitMaskImage = `url(${EXAMPLES[value].waveform})`; }); }

buildSelect(); updateMasks(); audio.src = EXAMPLES[exampleSelect.value].src;

root.classList.remove('d-none');

AI Generated Robotic Content

Recent Posts

Using depth maps and weight noising to get better character LoRAs

A few weeks ago I introduced a new method for training style LoRAs which has…

9 hours ago

The Statistics of Token Selection: Logits, Temperature, and Top-P Walkthrough

When large language models, or LLMs for short, produce outputs, several criteria are at stake,…

9 hours ago

Process financial documents using Amazon Bedrock Data Automation

Financial institutions process thousands of documents daily, including tax forms, loan statements, and purchase orders.…

9 hours ago

Introducing Google AI Threat Defense to help you outpace the adversary

aside_block <ListValue: [StructValue([('title', 'Summary of today’s news'), ('body', <wagtail.rich_text.RichText object at 0x7f00683723a0>), ('btn_text', ''), ('href',…

9 hours ago

Illinois Lawmakers Just Passed America’s Strongest AI Safety Bill

The bill requires companies like OpenAI, Anthropic, and Google to have third parties confirm they’re…

10 hours ago

Childlike AI uncovers why language grows more structured across generations

New research from the University of the Witwatersrand, South Africa, has significant implications for understanding…

10 hours ago