Categories: FAANG

Improving the Quality of Neural TTS Using Long-form Content and Multi-speaker Multi-style Modeling

Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the…
AI Generated Robotic Content

Recent Posts

KREA 2: Open-Source Release

Hey everyone, We're the team behind Krea, and today we're launching Krea 2, our new…

7 hours ago

Clustering Unstructured Text with LLM Embeddings and HDBSCAN

The current era of Generative AI seems to primarily focus on chat interfaces and prompts,…

7 hours ago

Build a protein research copilot with Amazon Bedrock AgentCore

Protein researchers face a time-consuming challenge: manually searching through thousands of peptide sequences to find…

7 hours ago

Verifiable, private AI: Google Cloud expands Confidential Computing frontiers

Protecting sensitive data used with AI is a critical part of our commitment to providing…

7 hours ago

Best Dyson Deals for Prime Day: Vacuums, Hair Tools, and More

It's one of the best times to snag yourself a Dyson device, whether it's a…

8 hours ago

Brain-inspired AI architecture could computing faster and far less power-hungry

Spiking neural networks (SNNs) are artificial intelligence (AI) models inspired by how biological neurons communicate…

8 hours ago