Categories: FAANG

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over…
AI Generated Robotic Content

Recent Posts

We may have a new SOTA open-source model: ERNIE-Image Comparisons

Base model is definitely SOTA, can even easily compete with closed-source ones in terms of…

1 hour ago

Navigating the generative AI journey: The Path-to-Value framework from AWS

Generative AI is reshaping how organizations approach productivity, customer experiences, and operational capabilities. Across industries,…

1 hour ago

The Surprising MacBook Neo Competitor You’ve Never Heard Of

In many ways, the HP OmniBook 5 is a better budget laptop than the MacBook…

2 hours ago

Tiny cameras in earbuds let users talk with AI about what they see

University of Washington researchers developed the first system that incorporates tiny cameras in off-the-shelf wireless…

2 hours ago

Update: Distilled v1.1 is live

We've pushed an LTX-2.3 update today. The Distilled model has been retrained (now v1.1) with…

1 day ago

How to Implement Tool Calling with Gemma 4 and Python

The open-weights model ecosystem shifted recently with the release of the

1 day ago