Transformers are Multi-State RNNs

production

architectures

robustness

TL;DR: Transformers can be conceptualized as infinite multi-state RNNs, and a new conversion policy, TOVA, significantly outperforms existing techniques.

Authors

Matanel Oren

Michael Hassid

Yossi Adi

Roy Schwartz

Published

January 11, 2024

Summary of “Transformers are Multi-State RNNs”

Main Findings

Decoder-only transformers can be viewed as infinite multi-state RNNs (MSRNNs), where the key and value vectors correspond to a multi-state that dynamically grows infinitely.
A novel policy, TOVA (Token Omission Via Attention), is introduced, which outperforms other baseline policies and can drastically reduce the memory consumption during inference.
Pretrained transformer decoder LLMs often behave in practice as finite MSRNNs and substantialy reduce the cache size with negligible performance degradation.

Introduction

Transformers have replaced RNNs for NLP due to their direct access to each token in a sequence.

Background

RNNs

RNNs process sequential data in a recurrent manner with a function that receives token representation and the hidden state from the previous time step.

Transformers

Process sequential data non-recurrently and consist of self-attention and feed-forward mechanisms.

Transformers as Multi-State RNNs

Multi-State RNNs

Defined as an RNN with a state matrix instead of a vector, parameterized by a function. #### Transformers are Infinite MSRNNs
Transformers can be viewed as an MSRNN, where the number of single-states equals the number of input tokens.

Converting Pretrained Transformers into Finite MSRNNs

Finite MSRNNs can be achieved by limiting the number of tokens processed at each step and using various compression policies.

Our Proposed Policy: TOVA

TOVA is a simpler, more powerful MSRNN compression policy that retains the top states based on the attention weights of the last token only.

Experimental Setup

Long-range tasks including language modeling, long-range understanding, and text generation were used for evaluation.

Pretrained Transformers Act as Finite MSRNNs

TOVA outperforms other policies in language modeling, long-range summarization, and performs well in text generation tasks.

Analysis

TOVA preserves recent tokens and some older tokens, shows a clear preference for the very first token, and highlights the importance of tokens such as punctuation and proper nouns.
Using TOVA enables a dramatic increase in the inference batch size.

Conclusion

The paper concludes that transformer decoder LLMs often behave as finite MSRNNs and introduces TOVA as a simple compression policy that performs well with minimal memory consumption.

Critique

The paper’s evaluation framework focuses mainly on the English language, which may not generalize to languages with different characteristics.
The evaluation of long-text generation is acknowledged as being complex and was evaluated indirectly using GPT-4, which may not fully capture the entire text’s quality.

Appendix

Model	gpt-3.5-turbo-1106
Date Generated	2024-02-26
Abstract	http://arxiv.org/abs/2401.06104v1
HTML	https://browse.arxiv.org/html/2401.06104v1
Truncated	False
Word Count	8490