Categories: FAANG

Improving the Quality of Neural TTS Using Long-form Content and Multi-speaker Multi-style Modeling

Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the…
AI Generated Robotic Content

Recent Posts

I built a free local AI image search app — find images by typing what’s in them

Built Makimus-AI, a free open source app that lets you search your entire image library…

7 hours ago

Building a Simple MCP Server in Python

Have you ever tried connecting a language model to your own data or tools? If…

7 hours ago

Build AI workflows on Amazon EKS with Union.ai and Flyte

As artificial intelligence and machine learning (AI/ML) workflows grow in scale and complexity, it becomes…

7 hours ago

Using Google Cloud AI to measure the physics of U.S. freestyle snowboarding and skiing

Nearly every snowboard trick carries a number. A 1080 means three full rotations. A 1440…

7 hours ago

A $10K Bounty Awaits Anyone Who Can Hack Ring Cameras to Stop Sharing Data With Amazon

The Fulu Foundation, a nonprofit that pays out bounties for removing user-hostile features, is hunting…

8 hours ago

Most AI bots lack basic safety disclosures, study finds

Many people use AI chatbots to plan meals and write emails, AI-enhanced web browsers to…

8 hours ago