Categories: FAANG

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over…
AI Generated Robotic Content

Recent Posts

SenseNova-U1 just dropped — native multimodal gen/understanding in one model, no VAE, no diffusion

What's new: Text rendering in images actually works. Diffusion models scramble text because they don't…

2 hours ago

Adaptive Thinking: Large Language Models Know When to Think in Latent Space

Recent advances in large language models (LLMs) test-time computing have introduced the capability to perform…

2 hours ago

Extracting contract insights with PwC’s AI-driven annotation on AWS

This post was co-written with Yash Munsadwala, Adam Hood, Justin Guse, and Hector Hernandez from…

2 hours ago

The founder’s AI foundation: The top announcements for startups from Next ‘26

The momentum is undeniable: the world’s fastest-growing AI startups are building with Google Cloud. Instead…

2 hours ago

How Elon Musk Squeezed OpenAI: They ‘Are Gonna Want to Kill Me’

Tensions flared on the third day of trial in Musk v. Altman as OpenAI’s lawyers…

3 hours ago