Categories: FAANG

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

*Equal Contributors
Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we significantly expand upon the capabilities of 4M by training it on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from…
AI Generated Robotic Content

Recent Posts

Tried longer videos with WAN 2.2 Animate

I altered the workflow a little bit from my previous post (using Hearmeman's Animate v2…

7 hours ago

10 Python One-Liners for Generating Time Series Features

Time series data normally requires an in-depth understanding in order to build effective and insightful…

7 hours ago

Evaluating Evaluation Metrics — The Mirage of Hallucination Detection

Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet…

7 hours ago

Announcing new capabilities in Vertex AI Training for large-scale training

Building and scaling generative AI models demands enormous resources, but this process can get tedious.…

7 hours ago

MiniMax-M2 is the new king of open source LLMs (especially for agentic tool calling)

Watch out, DeepSeek and Qwen! There's a new king of open source large language models…

8 hours ago

Elon Musk’s Grokipedia Pushes Far-Right Talking Points

The new AI-powered Wikipedia competitor falsely claims that pornography worsened the AIDS epidemic and that…

8 hours ago