Categories: FAANG

TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining

This paper was accepted to the ACL 2025 main conference as an oral presentation.
This paper was accepted at the Scalable Continual Learning for Lifelong Foundation Models (SCLLFM) Workshop at NeurIPS 2024.
Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) – orders of magnitude larger than previous continual language modeling benchmarks. We also…
AI Generated Robotic Content

Recent Posts

Flux Kontext Zoom Out LoRA

https://civitai.com/models/1800528?modelVersionId=2037657 https://huggingface.co/reverentelusarca/flux-kontext-zoom-out-lora submitted by /u/sktksm [link] [comments]

20 hours ago

Zero-Shot and Few-Shot Classification with Scikit-LLM

In this article, you will learn: • how Scikit-LLM integrates large language models like OpenAI's…

20 hours ago

Building a Plain Seq2Seq Model for Language Translation

This post is divided into five parts; they are: • Preparing the Dataset for Training…

20 hours ago

Synthetic Dataset Generation with Faker

In this article, you will learn: • how to use the Faker library in Python…

20 hours ago

Gemini 2.5 Flash-Lite is now ready for scaled production use

Gemini 2.5 Flash-Lite, previously in preview, is now stable and generally available. This cost-efficient model…

20 hours ago

Beyond accelerators: Lessons from building foundation models on AWS with Japan’s GENIAC program

In 2024, the Ministry of Economy, Trade and Industry (METI) launched the Generative AI Accelerator…

20 hours ago