Categories: FAANG

Adaptive Training Distributions with Scalable Online Bilevel Optimization

Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm, the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This work considers modifying the pretraining distribution in the case where one has a small sample of data reflecting the targeted test conditions. We propose an algorithm motivated by a recent formulation of this setting as an online, bilevel optimization problem. With scalability in mind, our algorithm prioritizes computing gradients at training points which are likely to…

Language Models Improve When Pretraining Data Matches Target Tasks

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine accordingly. This raises a natural question: what happens when we make this optimization explicit? To explore this, we propose benchmark-targeted ranking…

July 19, 2025

In "FAANG"

Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts

What distinguishes robust models from non-robust ones? While for ImageNet distribution shifts it has been shown that such differences in robustness can be traced back predominantly to differences in training data, so far it is not known what that translates to in terms of what the model has learned. In…

January 17, 2025

In "FAANG"

TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining

This paper was accepted to the ACL 2025 main conference as an oral presentation. This paper was accepted at the Scalable Continual Learning for Lifelong Foundation Models (SCLLFM) Workshop at NeurIPS 2024. Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update…

June 26, 2025

In "FAANG"

AI Generated Robotic Content