Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
The rapid progress of foundation models and large language models (LLMs) has fueled significantly improvement in the capabilities of machine learning systems that benefit from mutlimodal input data. However, existing multimodal models are predominantly built on top of pre-trained LLMs, which can limit accurate modeling of temporal dependencies across other modalities and thus limit the …
Read more “Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis”