Amazon Transcribe is a fully managed automatic speech recognition (ASR) service that makes it straightforward for you to add speech-to-text capabilities to your applications. Today, we are happy to announce a next-generation multi-billion parameter speech foundation model-powered system that expands automatic speech recognition to over 100 languages. In this post, we discuss some of the benefits of this system, how companies are using it, and how to get started. We also provide an example of the transcription output below.
Transcribe’s speech foundation model is trained using best-in-class, self-supervised algorithms to learn the inherent universal patterns of human speech across languages and accents. It is trained on millions of hours of unlabeled audio data from over 100 languages. The training recipes are optimized through smart data sampling to balance the training data between languages, ensuring that traditionally under-represented languages also reach high accuracy levels.
Carbyne is a software company that develops cloud-based, mission-critical contact center solutions for emergency call responders. Carbyne’s mission is to help emergency responders save lives, and language can’t get in the way of their goals. Here is how they use Amazon Transcribe to pursue their mission:
“AI-powered Carbyne Live Audio Translation is directly aimed at helping improve emergency response for the 68 million Americans who speak a language other than English at home, in addition to the up to 79 million foreign visitors to the country annually. By leveraging Amazon Transcribe’s new multilingual foundation model powered ASR, Carbyne will be even better equipped to democratize life-saving emergency services, because Every. Person. Counts.”
– Alex Dizengof, Co-Founder and CTO of Carbyne.
By leveraging speech foundation model, Amazon Transcribe delivers significant accuracy improvement between 20% and 50% across most languages. On telephony speech, which is a challenging and data-scarce domain, accuracy improvement is between 30% and 70%. In addition to substantial accuracy improvement, this large ASR model also delivers improvements in readability with more accurate punctuation and capitalization. With the advent of generative AI, thousands of enterprises are using Amazon Transcribe to unlock rich insights from their audio content. With significantly improved accuracy and support for over 100 languages, Amazon Transcribe will positively impact all such use cases. All existing and new customers using Amazon Transcribe in batch mode can access speech foundation model-powered speech recognition without needing any change to either the API endpoint or input parameters.
The new ASR system delivers several key features across all the 100+ languages related to ease of use, customization, user safety, and privacy. These include features such as automatic punctuation, custom vocabulary, automatic language identification, speaker diarization, word-level confidence scores, and custom vocabulary filter. The system’s expanded support for different accents, noise environments, and acoustic conditions enables you to produce more accurate outputs and thereby helps you effectively embed voice technologies in your applications.
Enabled by the high accuracy of Amazon Transcribe across different accents and noise conditions, its support for a large number of languages, and its breadth of value-added feature sets, thousands of enterprises will be empowered to unlock rich insights from their audio content, as well as increase the accessibility and discoverability of their audio and video content across various domains. For instance, contact centers transcribe and analyze customer calls to identify insights and subsequently improve customer experience and agent productivity. Content producers and media distributors automatically generate subtitles using Amazon Transcribe to improve content accessibility.
Get started with Amazon Transcribe
You can use the AWS Command Line Interface (AWS CLI), AWS Management Console, and various AWS SDKs for batch transcriptions and continue to use the same
StartTranscriptionJob API to get performance benefits from the enhanced ASR model without needing to make any code or parameter changes on your end. For more information about using the AWS CLI and the console, refer to Transcribing with the AWS CLI and Transcribing with the AWS Management Console, respectively.
The first step is to upload your media files into an Amazon Simple Storage Service (Amazon S3) bucket, an object storage service built to store and retrieve any amount of data from anywhere. Amazon S3 offers industry-leading durability, availability, performance, security, and virtually unlimited scalability at very low cost. You can choose to save your transcript in your own S3 bucket, or have Amazon Transcribe use a secure default bucket. To learn more about using S3 buckets, see Creating, configuring, and working with Amazon S3 buckets.
Amazon Transcribe uses JSON representation for its output. It provides the transcription result in two different formats: text format and itemized format. Nothing changes with respect to the API endpoint or input parameters.
The text format provides the transcript as a block of text, whereas itemized format provides the transcript in the form of timely ordered transcribed items, along with additional metadata per item. Both formats exist in parallel in the output file.
Depending on the features you select when creating the transcription job, Amazon Transcribe creates additional and enriched views of the transcription result. See the following example code:
The views are as follows:
- Transcripts – Represented by the
transcriptselement, it contains only the text format of the transcript. In multi-speaker, multi-channel scenarios, concatenation of all transcripts is provided as a single block.
- Speakers – Represented by the
speaker_labelselement, it contains the text and itemized formats of the transcript grouped by speaker. It’s available only when the multi-speakers feature is enabled.
- Channels – Represented by the
channel_labelselement, it contains the text and itemized formats of the transcript, grouped by channel. It’s available only when the multi-channels feature is enabled.
- Items – Represented by the
itemselement, it contains only the itemized format of the transcript. In multi-speaker, multi-channel scenarios, items are enriched with additional properties, indicating speaker and channel.
- Segments – Represented by the
segmentselement, it contains the text and itemized formats of the transcript, grouped by alternative transcription. It’s available only when the alternative results feature is enabled.
At AWS, we are constantly innovating on behalf of our customers. By extending the language support in Amazon Transcribe to over 100 languages, we enable our customers to serve users from diverse linguistic backgrounds. This not only enhances accessibility, but also opens up new avenues for communication and information exchange on a global scale. To learn more about the features discussed in this post, check out features page and what’s new post.
About the authors
Sumit Kumar is a Principal Product Manager, Technical at AWS AI Language Services team. He has 10 years of product management experience across a variety of domains and is passionate about AI/ML. Outside of work, Sumit loves to travel and enjoys playing cricket and Lawn-Tennis.
Vivek Singh is a Senior Manager, Product Management at AWS AI Language Services team. He leads the Amazon Transcribe product team. Prior to joining AWS, he held product management roles across various other Amazon organizations such as consumer payments and retail. Vivek lives in Seattle, WA and enjoys running, and hiking.