Categories: FAANG

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B), demonstrating that even relatively small Video LLMs can achieve state-of-the-art performance on video understanding, meeting the demand for…
AI Generated Robotic Content

Recent Posts

Wan2.2 Animate and Infinite Talk – First Renders (Workflow Included)

Just doing something a little different on this video. Testing Wan-Animate and heck while I’m…

23 hours ago

Bagging vs Boosting vs Stacking: Which Ensemble Method Wins in 2025?

Introduction In machine learning, no single model is perfect.

23 hours ago

Defensive Databases: Optimizing Index-Refresh Semantics

Editor’s Note: This is the first post in a series exploring how Palantir customizes infrastructure…

23 hours ago

Running deep research AI agents on Amazon Bedrock AgentCore

AI agents are evolving beyond basic single-task helpers into more powerful systems that can plan,…

23 hours ago

AI Innovators: How JAX on TPU is helping Escalante advance AI-driven protein design

As a Python library for accelerator-oriented array computation and program transformation, JAX is widely recognized…

23 hours ago

For One Glorious Morning, a Website Saved San Francisco From Parking Tickets

The serial website builder Riley Walz launched a project that tracked San Francisco parking enforcement…

24 hours ago