Categories: FAANG

TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining

This paper was accepted to the ACL 2025 main conference as an oral presentation.
This paper was accepted at the Scalable Continual Learning for Lifelong Foundation Models (SCLLFM) Workshop at NeurIPS 2024.
Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) – orders of magnitude larger than previous continual language modeling benchmarks. We also…
AI Generated Robotic Content

Recent Posts

Just tried HunyuanImage 2.1

Hey guys, I just tested out the new HunyuanImage 2.1 model on HF and… wow.…

23 hours ago

Multi-Agent Systems: The Next Frontier in AI-Driven Cyber Defense

The increasing sophistication of cyber threats calls for a systemic change in the way we…

23 hours ago

ROC AUC vs Precision-Recall for Imbalanced Data

When building machine learning models to classify imbalanced data — i.

23 hours ago

7 Scikit-learn Tricks for Optimized Cross-Validation

Validating machine learning models requires careful testing on unseen data to ensure robust, unbiased estimates…

23 hours ago

Powering innovation at scale: How AWS is tackling AI infrastructure challenges

As generative AI continues to transform how enterprises operate—and develop net new innovations—the infrastructure demands…

23 hours ago

Introducing the Agentic SOC Workshops for security professionals

The security operations centers of the future will use agentic AI to enable intelligent automation…

23 hours ago