Categories: FAANG

TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining

This paper was accepted to the ACL 2025 main conference as an oral presentation.
This paper was accepted at the Scalable Continual Learning for Lifelong Foundation Models (SCLLFM) Workshop at NeurIPS 2024.
Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) – orders of magnitude larger than previous continual language modeling benchmarks. We also…
AI Generated Robotic Content

Recent Posts

Agentic Workflow vs. Autonomous Agent: What’s the Difference?

In this article, you will learn how to distinguish agentic workflows from autonomous agents by…

15 hours ago

Retrofit, don’t rebuild: Agentic overlays for transforming legacy enterprise services

The opinions expressed in this post are the authors’ views and not those of Cisco.…

15 hours ago

Anthropic Thinks Its Own Success Is Key to Making AI Safe

Anthropic's critics argue it's rapidly accumulating power. The company says that's what responsible AI development…

16 hours ago

Agentic AI bot helps scientists speak to robots, speeding up experiments

Researchers at the Department of Energy's Pacific Northwest National Laboratory use a slew of autonomous…

16 hours ago

Context Windows Are Not Memory: What AI Agent Developers Need to Understand

In this article, you will learn why a large context window is not the same…

2 days ago

Huntington Bank: Redacting sensitive data from 400M+ documents with AWS

When your document repository contains hundreds of millions of files accumulated over nearly a decade,…

2 days ago