Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- A framework is presented for using transformer networks as universal computers.
- An input sequence acts as a punchcard, containing instructions and memory.
- Encoder layers can emulate basic computing blocks.
- These building blocks can emulate a small instruction-set computer.
- The transformer can emulate a basic calculator, linear algebra library, and in-context learning algorithms.
- This highlights the versatility of the attention mechanism.
Paper Content
Introduction
- Transformers (TFs) are popular for machine learning tasks and have achieved state-of-the-art results
- TFs can capture higher-order relationships and long-range dependencies
- Language models with billions of parameters can perform in-context learning
- Transformers can simulate Turing Machines
- Transformers can execute higher level programs
- Transformers can be programmed to emulate abstract computation
- Transformers can be used to create algorithms
- Transformers can be used to perform linear regression models
- Transformers can emulate complex algorithms and programs
- Transformers can execute programs written in a single instruction language
- Transformers can emulate a general-purpose computer
- Transformers can emulate basic calculations, numerical linear algebra algorithms, and in-context learning algorithms
- Transformers can be of depth less than 13 layers
Prior work
- Transformer networks have been studied for their expressive power and in-context learning capabilities
- Transformer networks are Turing complete, meaning they can simulate a Turing machine
- Transformers can act as universal sequence to sequence approximators
- A domain-specific language called RASP has been proposed to map the basic components of a TF encoder into simple primitives
- Transformers can serve as universal computers, enabling the implementation of arbitrary nonlinear functions
- Transformers can be trained from scratch to perform in-context learning of linear functions and more complex model classes
- Transformers can perform algorithmic reasoning using fewer layers than the number of reasoning steps
Preliminaries
- Transformer architecture uses embedding vectors instead of tokens
- Input to each layer is a vector representation of a sequence of n tokens
- Output of each layer is a function of the input matrix
- Iterative computation through a simple loop
- Input sequence includes instructions and memory
- Memory serves as a storage location for data
- Commands serve as instructions to guide the internal functioning of the transformer
- Operations are designed to be interoperable with each other
- Operations are building blocks to create more complex routines and algorithms
Positional encodings, program counter, and data pointers
- A transformer can locate the position of each token by appending positional encodings to each column of X.
- The positional encodings are binary representations of the column index.
- The encoding for token/column indexed by i is a log(n)-dimensional ±1 binary vector p i ∈ ±1 log(n).
- A program counter is used to iterate through commands.
- The program counter and data pointers use the same positional encodings.
- A transformer can read data/command vectors from the input to the scratchpad from the location pointed to by the position embedding vector in the scratchpad.
- A transformer can write a data vector stored in the scratchpad to a specific location in the input, as designated by a positional encoding vector in the scratchpad.
- A transformer can evaluate a condition and set the program counter to a specified location if the condition is true, or increment the program counter by 1 if the condition is false.
- A looped transformer architecture can run SUBLEQ programs.
Functions in the unified template form
- Demonstrates how to implement nonlinear functions and linear algebra operations using transformers
- Unified template for input/output parameters’ locations for each transformer-based function block
Encoding non-linear functions within the attention mechanism
- Attention mechanism can be used to encode functions
- Barron [1993] showed that functions can be approximated by linear combinations of sigmoids
- To encode N different functions, coefficients of sigmoids are stored in query and value weight matrices
- Transformer blocks can be used to approximate arbitrary functions
- Corollary 6 shows that one head is enough to encode m terms in the dimension of the transformer architecture
Matrix transposition and multiplication by linearizing the softmax
- Matrix A is represented by a sequence of length d
- Vectorized form of matrix is more suited to create transpose
- Matrix multiplication fits in unified template
- Leverage linearization of softmax
- Error is controlled by constant C
- Appendix A.2 provides exact form of input X and proof of Lemmas
Advantage of attention over fully-connected networks
- It is possible to implement functions and lexicographic functionality using fully connected networks.
- Attention-based networks require less depth than fully connected networks to compute simple functions.
- Constructing attention-based networks is not straightforward.
- FLEQ transformer can be used to build a simple calculator.
- Theorem 4 states that a transformer with 12 layers, m heads and dimensionality O(log n) can implement a calculator.
- Lemma 5 and Corollary 6 provide error guarantees.
- Algorithm 3 shows how to implement a calculator in the FLEQ framework.
- Calculator can be extended to include algebraic and trigonometric functions.
Linear algebra
- Matrix operations can be implemented using transformer-based architecture
- Matrix transpose and matrix multiplication can be implemented as function blocks
- Newton-Raphson Method and Power Iteration can be used to determine inverse of a matrix and eigenvector
- Linear algebra operations can be implemented using a transformer-based architecture
- QR decomposition, Gauss-Seidel, Arnoldi iteration, and Lanczos algorithm can be implemented using transformer construction
Emulating learning algorithms at inference time
- Stochastic Gradient Descent (SGD) can be implemented using a unified template.
- This template can be used to update the implicit weights of a model.
- Previous research has limited in-context learning to a single inference call of a transformer model.
- SGD can be implemented in linear models using Algorithm 9.
- Lemma 14 provides a transformer with 13 layers, 1 head and dimensionality O(log(D) + d) to implement T iterations of SGD on a weight vector w ∈ R d.
- SGD can be implemented on more general loss functions and models beyond linear regression.
- Backpropagation and SGD can be generalized to two layer neural networks with non-linear activation functions.
- The algorithm can be generalized to networks of arbitrary depth, with the caveat that the length of the code will scale with the number of layers in the network.
- The cost of this algorithm is proportional to looping the transformer network as many times as the depth of the network.