Categories: FAANG

Improving the Quality of Neural TTS Using Long-form Content and Multi-speaker Multi-style Modeling

Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the…
AI Generated Robotic Content

Recent Posts

Z Image Base Knows Things and Can Deliver

Just a few samples from a lora trained using Z image base. First 4 pictures…

7 hours ago

Agent Evaluation: How to Test and Measure Agentic AI Performance

AI agents that use tools, make decisions, and complete multi-step tasks aren't prototypes anymore.

7 hours ago

How Associa transforms document classification with the GenAI IDP Accelerator and Amazon Bedrock

This is a guest post co-written with David Meredith and Josh Zacharias from Associa. Associa,…

7 hours ago

Announcing Claude Opus 4.6 on Vertex AI

At Google Cloud, we’re committed to providing customers with the leading selection of models to…

7 hours ago

Two Titanic Structures Hidden Deep Within the Earth Have Altered the Magnetic Field for Millions of Years

A team of geologists found for the first time evidence linking regions of low seismic…

8 hours ago

AI agents debate more effectively when given personalities and the ability to interrupt

In a typical online meeting, humans don't always wait politely for their turn to speak.…

8 hours ago