Categories: FAANG

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over…
AI Generated Robotic Content

Recent Posts

Looneytunes background style for ZIT

So, only seven months after the SDXL version, here's a civitai link to the Z-Image…

1 hour ago

Local Mechanisms of Compositional Generalization in Conditional Diffusion

Conditional diffusion models appear capable of compositional generalization, i.e., generating convincing samples for out-of-distribution combinations…

1 hour ago

Connecting Agents to Decisions

The Palantir OntologyPalantir’s software powers real-time, human-agent decision-making in many of the most critical commercial and…

1 hour ago

Migrating a text agent to a voice assistant with Amazon Nova 2 Sonic

Migrating a text agent to a voice assistant is increasingly important because users expect faster,…

1 hour ago

50+ fully managed MCP servers now available for Google Cloud services

At Google Cloud Next ‘26, we announced that more than 50 Google-managed Model Context Protocol…

1 hour ago