Categories: FAANG

Exclusive Self Attention

We introduce exclusive self attention (XSA), a simple modification of self attention (SA) that improves Transformer’s sequence modeling performance. The key idea is to constrain attention to capture only information orthogonal to the token’s own value vector (thus excluding information of self position), encouraging better context modeling. Evaluated on the standard language modeling task, XSA consistently outperforms SA across model sizes up to 2.7B parameters and shows increasingly larger gains as sequence length grows.
AI Generated Robotic Content

Recent Posts

Flux2klein little info

So in the past few weeks I have been dedicating long hours into finding optimal…

4 hours ago

Python Decorators for Production Machine Learning Engineering

You've probably written a decorator or two in your Python career.

4 hours ago

MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining

This paper was accepted at the Workshop on Navigating and Addressing Data Problems for Foundation…

4 hours ago

Cost-efficient custom text-to-SQL using Amazon Nova Micro and Amazon Bedrock on-demand inference

Text-to-SQL generation remains a persistent challenge in enterprise AI applications, particularly when working with custom…

4 hours ago

How WPP accelerates humanoid robot training 10x with G4 VMs

Editor’s note: Today we hear from Perry Nightingale, SVP of Creative AI at WPP about…

4 hours ago

Dark Matter May Be Made of Black Holes From Another Universe

A model of the cyclic universe suggests that dark matter could be a population of…

5 hours ago