Categories: FAANG

Improving the Quality of Neural TTS Using Long-form Content and Multi-speaker Multi-style Modeling

Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the…
AI Generated Robotic Content

Recent Posts

Using Amazon Q Business with AWS HealthScribe to gain insights from patient consultations

With the advent of generative AI and machine learning, new opportunities for enhancement became available…

2 hours ago

How a 12-Ounce Layer of Foam Changed the NFL

Even the makers of the Guardian Cap admit it looks silly. But for a sport…

3 hours ago

Combining next-token prediction and video diffusion in computer vision and robotics

In the current AI zeitgeist, sequence models have skyrocketed in popularity for their ability to…

3 hours ago

What Is Perplexity AI? Understanding One Of Google’s Biggest Search Engine Competitors

What is Perplexity AI? Is it an over-hyped replacement for Google as a search engine,…

1 day ago

Scalable Private Search with Wally

This paper presents Wally, a private search system that supports efficient semantic and keyword search…

1 day ago

How DPG Media uses Amazon Bedrock and Amazon Transcribe to enhance video metadata with AI-powered pipelines

This post was co-written with Lucas Desard, Tom Lauwers, and Sam Landuydt from DPG Media.…

1 day ago