Categories: FAANG

TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining

This paper was accepted to the ACL 2025 main conference as an oral presentation.
This paper was accepted at the Scalable Continual Learning for Lifelong Foundation Models (SCLLFM) Workshop at NeurIPS 2024.
Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) – orders of magnitude larger than previous continual language modeling benchmarks. We also…
AI Generated Robotic Content

Recent Posts

Future of AI image generators

Listen. I honestly don’t know whether this is just coincidence, a deliberate decision, or simply…

6 hours ago

Implementing Prompt Compression to Reduce Agentic Loop Costs

Agentic loops in production can be synonymous with high costs, especially when it comes to…

6 hours ago

Building web search-enabled agents with Strands and Exa

This post is co written by Ishan Goswami and Nitya Sridhar from Exa. If you…

6 hours ago

Cloud Storage Rapid: Turbocharged object storage for AI and analytics

At Google Cloud Next ’26 we announced Cloud Storage Rapid, a family of object storage…

6 hours ago

Ilya Sutskever Stands by His Role in Sam Altman’s OpenAI Ouster: ‘I Didn’t Want It to Be Destroyed’

The former OpenAI chief scientist may be estranged from the company, but he still came…

7 hours ago

People struggle to recall whether content came from AI, with labels forgotten after one week

From August 2026, an EU-wide AI regulation will come into force requiring the labeling of…

7 hours ago