Massive Activations in Large Language Models

production

TL;DR: Large Language Models have massive activations with constant values, affecting attention probabilities. Also in Vision Transformers.

Author

Mingjie Sun, Xinlei Chen, J. Zico Kolter, Zhuang Liu

Published

February 27, 2024

Massive activations are observed in Large Language Models (LLMs) and Vision Transformers.
These activations are significantly larger than others and are input agnostic.
They function as crucial bias terms in LLMs and ViTs, leading to attention concentration and implicit bias terms in self-attention.

Massive activations are present in various LLMs and ViTs, spanning different model sizes and families.
These activations remain largely constant regardless of the input and function as indispensable bias terms in LLMs and ViTs.
Massive activations lead to the concentration of attention probabilities to their corresponding tokens and further implicit bias terms in the self-attention output.

The study provides valuable insights into the internal mechanisms of LLMs and ViTs, shedding light on the presence and functional role of massive activations.
The findings have implications for understanding the attention mechanisms and biases in these models, contributing to a deeper understanding of their internal workings.
Further research is needed to explore the impact of massive activations on model performance and to investigate potential applications of these findings in real-world scenarios.

Appendix