Categories: FAANG

Improving the Quality of Neural TTS Using Long-form Content and Multi-speaker Multi-style Modeling

Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the…
AI Generated Robotic Content

Recent Posts

Image upscale with Klein 9B

Prompt: upscale image and remove jpeg compression artifacts. Added few hours later: Please note that…

7 hours ago

KV Caching in LLMs: A Guide for Developers

Language models generate text one token at a time, reprocessing the entire sequence at each…

7 hours ago

Learnings from COBOL modernization in the real world

There’s a lot of excitement right now about AI enabling mainframe application modernization. Boards are…

7 hours ago

PayPal’s historically large data migration is the foundation for its gen AI innovation

With the dawn of the gen AI era, businesses are facing unprecedented opportunities for transformative…

7 hours ago

The Latest Repair Battlefield Is the Iowa Farmlands—Again

A new bill that would give farmers in Iowa the right to repair is a…

8 hours ago

Adaptive drafter model uses downtime to double LLM training speed

Reasoning large language models (LLMs) are designed to solve complex problems by breaking them down…

8 hours ago