Categories: FAANG

Compute-Optimal Quantization-Aware Training

Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previ-
ous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior
accuracy compared to QAT alone. However, the optimal allocation of compute between the FP and QAT phases remains
unclear. We conduct extensive experiments with various compute budgets, QAT bit widths, and model sizes from 86.0M
to 2.2B to investigate how different QAT durations impact final performance. We demonstrate that, contrary to previous
findings, the…

Introducing Accurate Quantized Training (AQT) for accelerated ML training on TPU v5e

November 9, 2023

In "FAANG"

Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of…

November 8, 2024

In "FAANG"

Apple Intelligence Foundation Language Models Tech Report 2025

We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: (i) a ∼3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and (ii) a scalable server model built on a novel Parallel-Track Mixture-of-Experts…

July 18, 2025

In "FAANG"

AI Generated Robotic Content