Amazon Polly is a service that turns text into lifelike speech. It enables the development of a whole class of applications that can convert text into speech in multiple languages.
This service can be used by chatbots, audio books, and other text-to-speech applications in conjunction with other AWS AI or machine learning (ML) services. For example, Amazon Lex and Amazon Polly can be combined to create a chatbot that engages in a two-way conversation with a user and performs certain tasks based on the user’s commands. Amazon Transcribe, Amazon Translate, and Amazon Polly can be combined to transcribe speech to text in the source language, translate it to a different language, and speak it.
In this post, we present an interesting approach for highlighting text as it’s being spoken using Amazon Polly. This solution can be used in many text-to-speech applications to do the following:
- Add visual capabilities to audio in books, websites, and blogs
- Increase comprehension when customers are trying to understand the text rapidly as it’s being spoken
Our solution gives the client (the browser, in this example), the ability to know what text (word or sentence) is being spoken by Amazon Polly at any instant. This enables the client to dynamically highlight the text as it’s being spoken. Such a capability is useful for providing visual aid to speech for the use cases mentioned previously.
Our solution can be extended to perform additional tasks besides highlighting text. For example, the browser can show images, play music, or perform other animations on the front end as the text is being spoken. This capability is useful for creating dynamic audio books, educational content, and richer text-to-speech applications.
At its core, the solution uses Amazon Polly to convert a string of text into speech. The text can be input from the browser or through an API call to the endpoint exposed by our solution. The speech generated by Amazon Polly is stored as an audio file (MP3 format) in an Amazon Simple Storage Service (Amazon S3) bucket.
However, using the audio file alone, the browser can’t find what parts of the text are being spoken at any instant because we don’t have granular information on when each word is spoken.
Amazon Polly provides a way to obtain this using speech marks. Speech marks are stored in a text file that shows the time (measured in milliseconds from start of the audio) when each word or sentence is spoken.
Amazon Polly returns speech mark objects in a line-delimited JSON stream. A speech mark object contains the following fields:
- Time – The timestamp in milliseconds from the beginning of the corresponding audio stream
- Type – The type of speech mark (sentence, word, viseme, or SSML)
- Start – The offset in bytes (not characters) of the start of the object in the input text (not including viseme marks)
- End – The offset in bytes (not characters) of the object’s end in the input text (not including viseme marks)
- Value – This varies depending on the type of speech mark:
- SSML – <mark> SSML tag
- Viseme – The viseme name
- Word or sentence – A substring of the input text as delimited by the start and end fields
For example, the sentence “Mary had a little lamb” can give you the following speech marks file if you use
SpeechMarkTypes = [“word”, “sentence”] in the API call to obtain the speech marks:
The word “had” (at the end of line 3) begins 373 milliseconds after the audio stream begins, starts at byte 5, and ends at byte 8 of the input text.
The architecture of our solution is presented in the following diagram.
The Lambda function creates pre-signed URLs for the speech and speech marks files and returns them to the browser in the form of an array (7, 8, 9).
When the browser sends the text file to the API endpoint (3), it gets back two pre-signed URLs for the audio file and the speech marks file in one synchronous invocation (9). This is indicated by the key symbol next to the arrow.
To run this solution, you need an AWS account with an AWS Identity and Access Management (IAM) user who has permission to use Amazon CloudFront, Amazon API Gateway, Amazon Polly, Amazon S3, AWS Lambda, and AWS Step Functions.
Use Lambda to generate speech and speech marks
The following code invokes the Amazon Polly
synthesize_speech function two times to fetch the audio and speech marks file. They’re run as asynchronous functions and coordinated to return the result at the same time using promises.
Instead of the previous approach, you can consider a few alternatives:
- Create both the speech marks and audio files inside a Step Functions state machine. The state machine can invoke the parallel branch condition to invoke two different Lambda functions: one to generate speech and another to generate speech marks. The code for this can be found in the using-step-functions subfolder in the Github repo.
- Invoke Amazon Polly asynchronously to generate the audio and speech marks. This approach can be used if the text content is large or the user doesn’t need a real-time response. For more details about creating long audio files, refer to Creating Long Audio Files.
- Have Amazon Polly create the presigned URL directly using the
generate_presigned_urlcall on the Amazon Polly client in Boto3. If you go with this approach, Amazon Polly generates the audio and speech marks newly every time. In our current approach, we store these files in Amazon S3. Although these stored files aren’t accessible from the browser in our version of the code, you can modify the code to play previously generated audio files by fetching them from Amazon S3 (instead of regenerating the audio for the text again using Amazon Polly). We have more code examples for accessing Amazon Polly with Python in the AWS Code Library.
Create the solution
The entire solution is available from our Github repo. To create this solution in your account, follow the instructions in the README.md file. The solution includes an AWS CloudFormation template to provision your resources.
To clean up the resources created in this demo, perform the following steps:
- Delete the S3 buckets created to store the CloudFormation template (Bucket A), the source code (Bucket B) and the website (
- Delete the CloudFormation stack
- Delete the S3 bucket containing the speech files (
pth-speech-[Suffix]). This bucket was created by the CloudFormation template to store the audio and speech marks files generated by Amazon Polly.
In this post, we showed an example of a solution that can highlight text as it’s being spoken using Amazon Polly. It was developed using the Amazon Polly speech marks feature, which provides us markers for the place each word or sentence begins in an audio file.
The solution is available as a CloudFormation template. It can be deployed as is to any web application that performs text-to-speech conversion. This would be useful for adding visual capabilities to audio in books, avatars with lip-sync capabilities (using viseme speech marks), websites, and blogs, and for aiding people with hearing impairments.
It can be extended to perform additional tasks besides highlighting text. For example, the browser can show images, play music, and perform other animations on the front end while the text is being spoken. This capability can be useful for creating dynamic audio books, educational content, and richer text-to-speech applications.
We welcome you to try out this solution and learn more about the relevant AWS services from the following links. You can extend the functionality for your specific needs.
About the Author
Varad G Varadarajan is a Trusted Advisor and Field CTO for Digital Native Businesses (DNB) customers at AWS. He helps them architect and build innovative solutions at scale using AWS products and services. Varad’s areas of interest are IT strategy consulting, architecture, and product management. Outside of work, Varad enjoys creative writing, watching movies with family and friends, and traveling.