Categories: FAANG

Improving the Quality of Neural TTS Using Long-form Content and Multi-speaker Multi-style Modeling

Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the…
AI Generated Robotic Content

Recent Posts

A Gentle Introduction to Language Model Fine-tuning

This article is divided into four parts; they are: • The Reason for Fine-tuning a…

7 hours ago

Mastering LLM Tool Calling: The Complete Framework for Connecting Models to the Real World

Most ChatGPT users don't know this, but when the model searches the web for current…

7 hours ago

Improving User Interface Generation Models from Designer Feedback

Despite being trained on vast amounts of data, most LLMs are unable to reliably generate…

7 hours ago

Lenovo’s Legion Pro Rollable Gaming Laptop Goes Ultrawide at the Press of a Key

Lenovo brought a Legion gaming laptop to CES this year with a rollable OLED display…

8 hours ago

Scientists create robots smaller than a grain of salt that can think

Researchers have created microscopic robots so small they’re barely visible, yet smart enough to sense,…

8 hours ago