Categories: FAANG

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

This paper has been accepted at the Data Problems for Foundation Models workshop at ICLR 2024.
Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we proposeWebRephrase Augmented Pre-training…
AI Generated Robotic Content

Recent Posts

Ai TikTok scams becoming more realistic.

I'm just attaching one video but 100's of them have popped up in the last…

17 hours ago

‘Crimson Desert’ Is a Cat Dad Simulator

Step into the shoes of the strongest, goodest boy in a game that is beautiful,…

18 hours ago

After ~400 Z-Image Turbo gens I finally figured out why everyone’s portraits look plastic

Been using Z-Image Turbo pretty heavily since it dropped and wanted to dump some notes…

2 days ago

Evaluating Netflix Show Synopses with LLM-as-a-Judge

by Gabriela Alessio, Cameron Taylor, and Cameron R. WolfeIntroductionWhen members log into Netflix, one of the…

2 days ago