Categories: FAANG

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over…
AI Generated Robotic Content

Recent Posts

Sulphur 2 Uncensored Video Gen

I'll try to keep this as short as possible, but me and a team of…

2 hours ago

Effective KV Compression with TurboQuant

TurboQuant has recently been launched by Google as a novel algorithmic suite and library for…

2 hours ago

STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows

Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained…

2 hours ago

Ready, Set, Build with the NHS Federated Data Platform

The National Health Service (NHS) has delivered universal healthcare to an entire nation for over…

2 hours ago

Reinforcement fine-tuning with LLM-as-a-judge

Large language models (LLMs) now drive the most advanced conversational agents, creative tools, and decision-support…

2 hours ago

Cloud CISO Perspectives: At Next ‘26, why we’re multicloud and multi-AI

Welcome to the second Cloud CISO Perspectives for April 2026. Today, Francis deSouza, COO Google…

2 hours ago