Transformers are Multi-State RNNs
production
architectures
robustness
TL;DR: Transformers can be conceptualized as infinite multi-state RNNs, and a new conversion policy, TOVA, significantly outperforms existing techniques.
Summary of “Transformers are Multi-State RNNs”
Main Findings
- Decoder-only transformers can be viewed as infinite multi-state RNNs (MSRNNs), where the key and value vectors correspond to a multi-state that dynamically grows infinitely.
- A novel policy, TOVA (Token Omission Via Attention), is introduced, which outperforms other baseline policies and can drastically reduce the memory consumption during inference.
- Pretrained transformer decoder LLMs often behave in practice as finite MSRNNs and substantialy reduce the cache size with negligible performance degradation.
Introduction
- Transformers have replaced RNNs for NLP due to their direct access to each token in a sequence.
Background
RNNs
- RNNs process sequential data in a recurrent manner with a function that receives token representation and the hidden state from the previous time step.
Transformers
- Process sequential data non-recurrently and consist of self-attention and feed-forward mechanisms.
Transformers as Multi-State RNNs
Multi-State RNNs
- Defined as an RNN with a state matrix instead of a vector, parameterized by a function. #### Transformers are Infinite MSRNNs
- Transformers can be viewed as an MSRNN, where the number of single-states equals the number of input tokens.
Converting Pretrained Transformers into Finite MSRNNs
- Finite MSRNNs can be achieved by limiting the number of tokens processed at each step and using various compression policies.
Our Proposed Policy: TOVA
- TOVA is a simpler, more powerful MSRNN compression policy that retains the top states based on the attention weights of the last token only.
Experimental Setup
- Long-range tasks including language modeling, long-range understanding, and text generation were used for evaluation.
Pretrained Transformers Act as Finite MSRNNs
- TOVA outperforms other policies in language modeling, long-range summarization, and performs well in text generation tasks.
Analysis
- TOVA preserves recent tokens and some older tokens, shows a clear preference for the very first token, and highlights the importance of tokens such as punctuation and proper nouns.
- Using TOVA enables a dramatic increase in the inference batch size.
Conclusion
- The paper concludes that transformer decoder LLMs often behave as finite MSRNNs and introduces TOVA as a simple compression policy that performs well with minimal memory consumption.
Critique
- The paper’s evaluation framework focuses mainly on the English language, which may not generalize to languages with different characteristics.
- The evaluation of long-text generation is acknowledged as being complex and was evaluated indirectly using GPT-4, which may not fully capture the entire text’s quality.
Appendix
Model | gpt-3.5-turbo-1106 |
Date Generated | 2024-02-26 |
Abstract | http://arxiv.org/abs/2401.06104v1 |
HTML | https://browse.arxiv.org/html/2401.06104v1 |
Truncated | False |
Word Count | 8490 |