Categories: FAANG

Improving the Quality of Neural TTS Using Long-form Content and Multi-speaker Multi-style Modeling

Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the…
AI Generated Robotic Content

Recent Posts

Qwen-Image has been released

submitted by /u/theivan [link] [comments]

15 hours ago

Building a Decoder-Only Transformer Model for Text Generation

This post is divided into five parts; they are: • From a Full Transformer to…

15 hours ago

Rethinking how we measure AI intelligence

Game Arena is a new, open-source platform for rigorous evaluation of AI models. It allows…

15 hours ago

Ambisonics Super-Resolution Using A Waveform-Domain Neural Network

Ambisonics is a spatial audio format describing a sound field. First-order Ambisonics (FOA) is a…

15 hours ago

Cost tracking multi-tenant model inference on Amazon Bedrock

Organizations serving multiple tenants through AI applications face a common challenge: how to track, analyze,…

15 hours ago

Optimize your cloud costs using Cloud Hub Optimization and Cost Explorer

Application owners are looking for three things when they think about optimizing cloud costs: What…

15 hours ago