Massive Activations in Large Language Models
production
TL;DR: Large Language Models have massive activations with constant values, affecting attention probabilities. Also in Vision Transformers.
Summary:
- Massive activations are observed in Large Language Models (LLMs) and Vision Transformers.
- These activations are significantly larger than others and are input agnostic.
- They function as crucial bias terms in LLMs and ViTs, leading to attention concentration and implicit bias terms in self-attention.
Major Findings:
- Massive activations are present in various LLMs and ViTs, spanning different model sizes and families.
- These activations remain largely constant regardless of the input and function as indispensable bias terms in LLMs and ViTs.
- Massive activations lead to the concentration of attention probabilities to their corresponding tokens and further implicit bias terms in the self-attention output.
Analysis and Critique:
- The study provides valuable insights into the internal mechanisms of LLMs and ViTs, shedding light on the presence and functional role of massive activations.
- The findings have implications for understanding the attention mechanisms and biases in these models, contributing to a deeper understanding of their internal workings.
- Further research is needed to explore the impact of massive activations on model performance and to investigate potential applications of these findings in real-world scenarios.
Appendix
Model | gpt-3.5-turbo-1106 |
Date Generated | 2024-02-28 |
Abstract | https://arxiv.org/abs/2402.17762v1 |
HTML | https://browse.arxiv.org/html/2402.17762v1 |
Truncated | False |
Word Count | 11591 |