Categories: FAANG

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B), demonstrating that even relatively small Video LLMs can achieve state-of-the-art performance on video understanding, meeting the demand for…
AI Generated Robotic Content

Recent Posts

Anthropic Thinks Its Own Success Is Key to Making AI Safe

Anthropic's critics argue it's rapidly accumulating power. The company says that's what responsible AI development…

23 mins ago

Agentic AI bot helps scientists speak to robots, speeding up experiments

Researchers at the Department of Energy's Pacific Northwest National Laboratory use a slew of autonomous…

23 mins ago

Context Windows Are Not Memory: What AI Agent Developers Need to Understand

In this article, you will learn why a large context window is not the same…

23 hours ago

Huntington Bank: Redacting sensitive data from 400M+ documents with AWS

When your document repository contains hundreds of millions of files accumulated over nearly a decade,…

23 hours ago

The Skylight Calendar Is One of My Favorite Products On Sale for Prime Day

The Skylight Calendar 2 and Calendar Max are both on sale for Prime Day if…

1 day ago

Neural-machine interfaces reveal that brain senses hand movement through grasp synergies

A research team led by Sant'Anna School of Advanced Studies in Pisa, in collaboration with…

1 day ago