Categories: FAANG

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B), demonstrating that even relatively small Video LLMs can achieve state-of-the-art performance on video understanding, meeting the demand for…
AI Generated Robotic Content

Recent Posts

What tools would you use to make morphing videos like this?

submitted by /u/nikitagent [link] [comments]

4 hours ago

Bias after Prompting: Persistent Discrimination in Large Language Models

A dangerous assumption that can be made from prior work on the bias transfer hypothesis…

4 hours ago

Post-Training Generative Recommenders with Advantage-Weighted Supervised Finetuning

Author: Keertana Chidambaram, Qiuling Xu, Ko-Jen Hsiao, Moumita Bhattacharya(*The work was done when Keertana interned…

4 hours ago

When your AI browser becomes your enemy: The Comet security disaster

Remember when browsers were simple? You clicked a link, a page loaded, maybe you filled…

5 hours ago

Baseus Inspire XC1 Review: Excellent Open Earbuds

These affordable open buds come with Bose-crafted sound.

5 hours ago

DeepMind introduces AI agent that learns to complete various tasks in a scalable world model

Over the past decade, deep learning has transformed how artificial intelligence (AI) agents perceive and…

5 hours ago