Categories: FAANG

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over…
AI Generated Robotic Content

Recent Posts

Train, Serve, and Deploy a Scikit-learn Model with FastAPI

FastAPI has become one of the most popular ways to serve machine learning models because…

5 hours ago

Apple Machine Learning Research at ICLR 2026

Apple is advancing AI and ML with fundamental research, much of which is shared through…

5 hours ago

Frontend Engineering at Palantir: Engineering Multilingual Collaboration

Frontend Engineering at Palantir: Building Multilingual CollaborationAbout this SeriesFrontend engineering at Palantir goes far beyond…

5 hours ago

Cost-effective multilingual audio transcription at scale with Parakeet-TDT and AWS Batch

Many organizations are archiving large media libraries, analyzing contact center recordings, preparing training data for…

5 hours ago

Day 1 at Google Cloud Next ‘26 recap

Last year at Google Cloud Next ‘25, we asked you to imagine a new future…

5 hours ago