Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • A framework is presented for using transformer networks as universal computers.
  • An input sequence acts as a punchcard, containing instructions and memory.
  • Encoder layers can emulate basic computing blocks.
  • These building blocks can emulate a small instruction-set computer.
  • The transformer can emulate a basic calculator, linear algebra library, and in-context learning algorithms.
  • This highlights the versatility of the attention mechanism.

Paper Content

Introduction

  • Transformers (TFs) are popular for machine learning tasks and have achieved state-of-the-art results
  • TFs can capture higher-order relationships and long-range dependencies
  • Language models with billions of parameters can perform in-context learning
  • Transformers can simulate Turing Machines
  • Transformers can execute higher level programs
  • Transformers can be programmed to emulate abstract computation
  • Transformers can be used to create algorithms
  • Transformers can be used to perform linear regression models
  • Transformers can emulate complex algorithms and programs
  • Transformers can execute programs written in a single instruction language
  • Transformers can emulate a general-purpose computer
  • Transformers can emulate basic calculations, numerical linear algebra algorithms, and in-context learning algorithms
  • Transformers can be of depth less than 13 layers

Prior work

  • Transformer networks have been studied for their expressive power and in-context learning capabilities
  • Transformer networks are Turing complete, meaning they can simulate a Turing machine
  • Transformers can act as universal sequence to sequence approximators
  • A domain-specific language called RASP has been proposed to map the basic components of a TF encoder into simple primitives
  • Transformers can serve as universal computers, enabling the implementation of arbitrary nonlinear functions
  • Transformers can be trained from scratch to perform in-context learning of linear functions and more complex model classes
  • Transformers can perform algorithmic reasoning using fewer layers than the number of reasoning steps

Preliminaries

  • Transformer architecture uses embedding vectors instead of tokens
  • Input to each layer is a vector representation of a sequence of n tokens
  • Output of each layer is a function of the input matrix
  • Iterative computation through a simple loop
  • Input sequence includes instructions and memory
  • Memory serves as a storage location for data
  • Commands serve as instructions to guide the internal functioning of the transformer
  • Operations are designed to be interoperable with each other
  • Operations are building blocks to create more complex routines and algorithms

Positional encodings, program counter, and data pointers

  • A transformer can locate the position of each token by appending positional encodings to each column of X.
  • The positional encodings are binary representations of the column index.
  • The encoding for token/column indexed by i is a log(n)-dimensional ±1 binary vector p i ∈ ±1 log(n).
  • A program counter is used to iterate through commands.
  • The program counter and data pointers use the same positional encodings.
  • A transformer can read data/command vectors from the input to the scratchpad from the location pointed to by the position embedding vector in the scratchpad.
  • A transformer can write a data vector stored in the scratchpad to a specific location in the input, as designated by a positional encoding vector in the scratchpad.
  • A transformer can evaluate a condition and set the program counter to a specified location if the condition is true, or increment the program counter by 1 if the condition is false.
  • A looped transformer architecture can run SUBLEQ programs.

Functions in the unified template form

  • Demonstrates how to implement nonlinear functions and linear algebra operations using transformers
  • Unified template for input/output parameters’ locations for each transformer-based function block

Encoding non-linear functions within the attention mechanism

  • Attention mechanism can be used to encode functions
  • Barron [1993] showed that functions can be approximated by linear combinations of sigmoids
  • To encode N different functions, coefficients of sigmoids are stored in query and value weight matrices
  • Transformer blocks can be used to approximate arbitrary functions
  • Corollary 6 shows that one head is enough to encode m terms in the dimension of the transformer architecture

Matrix transposition and multiplication by linearizing the softmax

  • Matrix A is represented by a sequence of length d
  • Vectorized form of matrix is more suited to create transpose
  • Matrix multiplication fits in unified template
  • Leverage linearization of softmax
  • Error is controlled by constant C
  • Appendix A.2 provides exact form of input X and proof of Lemmas

Advantage of attention over fully-connected networks

  • It is possible to implement functions and lexicographic functionality using fully connected networks.
  • Attention-based networks require less depth than fully connected networks to compute simple functions.
  • Constructing attention-based networks is not straightforward.
  • FLEQ transformer can be used to build a simple calculator.
  • Theorem 4 states that a transformer with 12 layers, m heads and dimensionality O(log n) can implement a calculator.
  • Lemma 5 and Corollary 6 provide error guarantees.
  • Algorithm 3 shows how to implement a calculator in the FLEQ framework.
  • Calculator can be extended to include algebraic and trigonometric functions.

Linear algebra

  • Matrix operations can be implemented using transformer-based architecture
  • Matrix transpose and matrix multiplication can be implemented as function blocks
  • Newton-Raphson Method and Power Iteration can be used to determine inverse of a matrix and eigenvector
  • Linear algebra operations can be implemented using a transformer-based architecture
  • QR decomposition, Gauss-Seidel, Arnoldi iteration, and Lanczos algorithm can be implemented using transformer construction

Emulating learning algorithms at inference time

  • Stochastic Gradient Descent (SGD) can be implemented using a unified template.
  • This template can be used to update the implicit weights of a model.
  • Previous research has limited in-context learning to a single inference call of a transformer model.
  • SGD can be implemented in linear models using Algorithm 9.
  • Lemma 14 provides a transformer with 13 layers, 1 head and dimensionality O(log(D) + d) to implement T iterations of SGD on a weight vector w ∈ R d.
  • SGD can be implemented on more general loss functions and models beyond linear regression.
  • Backpropagation and SGD can be generalized to two layer neural networks with non-linear activation functions.
  • The algorithm can be generalized to networks of arbitrary depth, with the caveat that the length of the code will scale with the number of layers in the network.
  • The cost of this algorithm is proportional to looping the transformer network as many times as the depth of the network.