Categories: FAANG

Improving the Quality of Neural TTS Using Long-form Content and Multi-speaker Multi-style Modeling

Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the…
AI Generated Robotic Content

Recent Posts

After ~400 Z-Image Turbo gens I finally figured out why everyone’s portraits look plastic

Been using Z-Image Turbo pretty heavily since it dropped and wanted to dump some notes…

5 hours ago

Evaluating Netflix Show Synopses with LLM-as-a-Judge

by Gabriela Alessio, Cameron Taylor, and Cameron R. WolfeIntroductionWhen members log into Netflix, one of the…

5 hours ago

How SAP Concur automates expense reporting with agentic AI

For decades, expense automation relied on a simple premise: If the machine can read the…

5 hours ago

Artemis II Returns From Historic Flight Around the Moon

After traveling a greater distance from Earth than any humans before them, the astronauts of…

6 hours ago