Combining Compressions for Multiplicative Size Scaling on Natural Language Tasks
Quantization, knowledge distillation, and magnitude pruning are among the most popular methods for neural network compression in NLP. Independently, these methods reduce model size and can accelerate inference, but their relative benefit and combinatorial inter- actions have not been rigorously studied. For each of the eight possible subsets of these techniques, we compare accuracy vs. …
Read more “Combining Compressions for Multiplicative Size Scaling on Natural Language Tasks”