Categories: FAANG

Improving the Quality of Neural TTS Using Long-form Content and Multi-speaker Multi-style Modeling

Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the…
AI Generated Robotic Content

Recent Posts

Comfy raises $30M to continue building the best creative AI tool in open

Hi r/StableDiffusion, Today we’re excited to share that Comfy has raised $30M at a $500M…

5 hours ago

Learning Long-Term Motion Embeddings for Efficient Kinematics Generation

Understanding and predicting motion is a fundamental component of visual intelligence. Although modern video models…

5 hours ago

Scaling Camera File Processing at Netflix

Orchestrating Media Workflows Through Strategic CollaborationAuthors: Eric Reinecke, Bhanu SrikanthIntroduction to Content Hub’s Media Production SuiteAt…

5 hours ago

Building Workforce AI Agents with Visier and Amazon Quick

Employees across every function are expected to make faster, better-informed decisions, but the information that…

5 hours ago

Day 2 at Google Cloud Next: A marathon developer keynote

At Google Cloud, every day is Developer Day, but none so much as day 2…

5 hours ago

Give Mom Warm Coffee All Year Long With This Ember Smart Mug Deal

If an Ember mug is at the top of Mom’s wish list, this sale is…

6 hours ago