Categories: FAANG

Combining Compressions for Multiplicative Size Scaling on Natural Language Tasks

Quantization, knowledge distillation, and magnitude pruning are among the most popular methods for neural network compression in NLP. Independently, these methods reduce model size and can accelerate inference, but their relative benefit and combinatorial inter- actions have not been rigorously studied. For each of the eight possible subsets of these techniques, we compare accuracy vs. model size tradeoffs across six BERT architecture sizes and eight GLUE tasks. We find that quantization and distillation consistently provide greater benefit than pruning. Surprisingly, except for the pair of…
AI Generated Robotic Content

Recent Posts

Detecting & Handling Data Drift in Production

Machine learning models are trained on historical data and deployed in real-world environments.

12 hours ago

Quantization in Machine Learning: 5 Reasons Why It Matters More Than You Think

Quantization might sound like a topic reserved for hardware engineers or AI researchers in lab…

12 hours ago

Introducing Gemini 2.5 Flash

Gemini 2.5 Flash is our first fully hybrid reasoning model, giving developers the ability to…

12 hours ago

Disentangled Representational Learning with the Gromov-Monge Gap

Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning. Solving it…

12 hours ago

Palantir’s Blueprint for Early Career Success in Product Design

Editor’s Note: Product Designers are key members of Palantir product teams. This blog post features…

12 hours ago

Add Zoom as a data accessor to your Amazon Q index

For many organizations, vast amounts of enterprise knowledge are scattered across diverse data sources and…

12 hours ago