Categories: FAANG

DataComp: In Search of the Next Generation of Multimodal Datasets

*=Equal Contributors
Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training…
AI Generated Robotic Content

Recent Posts

7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings

Large language models (LLMs) are not only good at understanding and generating text; they can…

13 hours ago

Accelerating discovery with the AI for Math Initiative

The initiative brings together some of the world's most prestigious research institutions to pioneer the…

13 hours ago

Toward Machine Interpreting: Lessons from Human Interpreting Studies

Current speech translation systems, while having achieved impressive accuracies, are rather static in their behavior…

13 hours ago

Vibe coding platform Cursor releases first in-house LLM, Composer, promising 4X speed boost

The vibe coding tool Cursor, from startup Anysphere, has introduced Composer, its first in-house, proprietary…

14 hours ago

The Microsoft Azure Outage Shows the Harsh Reality of Cloud Failures

The second major cloud outage in less than two weeks, Azure’s downtime highlights the “brittleness”…

14 hours ago