Categories: FAANG

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over…
AI Generated Robotic Content

Recent Posts

Statistical Methods for Evaluating LLM Performance

The large language model (LLM) has become a cornerstone of many AI applications.

3 hours ago

Getting started with computer use in Amazon Bedrock Agents

Computer use is a breakthrough capability from Anthropic that allows foundation models (FMs) to visually…

3 hours ago

OpenAI’s strategic gambit: The Agents SDK and why it changes everything for enterprise AI

OpenAI's new API and Agents SDK consolidate a previously fragmented complex ecosystem into a unified,…

4 hours ago

Under Trump, AI Scientists Are Told to Remove ‘Ideological Bias’ From Powerful Models

A directive from the National Institute of Standards and Technology eliminates mention of “AI safety”…

4 hours ago

Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models

Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of…

1 day ago

How GoDaddy built a category generation system at scale with batch inference for Amazon Bedrock

This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team…

1 day ago