Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Many tasks can be broken down into subroutines.
  • Neural networks can achieve impressive performance on vision and language tasks, but it is not known how they do this.
  • One possibility is that neural networks break down tasks into subroutines and compose them into an overall solution.
  • Model pruning techniques are used to investigate this question in vision and language tasks.
  • Results suggest that neural networks can learn to exhibit compositionality.

Paper Content

Introduction

  • Neural networks have come to dominate AI, but much is unknown about the functions they learn
  • Debate over role of compositionality
  • Compositionality is a key property of human cognition
  • Representation system is compositional if it implements discrete constituent functions
  • Open question whether neural networks require explicit symbolic mechanisms to implement compositional solutions
  • Historically, neural networks have been considered non-compositional
  • Modern neural networks have demonstrated successes on complex tasks
  • Introduce concept of structural compositionality
  • Introduce technique to test for structural compositionality
  • Test on models and tasks with odd-one-out tasks
  • Non-compositional solution stores prototype that encodes conjunction of two subroutines
  • Compositional solution computes subroutines in modular subnetworks

Structural compositionality

  • Prior work on compositionality in neural networks has mostly yielded negative results.
  • Definitions of compositionality are based on a system’s representations, not its behavior.
  • This paper focuses on evaluating the extent to which a model’s representations are structured compositionally.
  • There are two ways a network might learn to solve a compositional task.
  • If a model exhibits structural compositionality, it should be possible to find a subnetwork that implements one subroutine and not the other.

Experimental design

Preliminaries

  • Subroutine: A binary rule
  • Compositional Rule: A binary rule that maps input to output using subroutines
  • Base Model: A model trained to solve a task defined by a compositional rule
  • Subnetwork: A subset of the parameters of a base model, implementing one subroutine
  • Ablated Model: A model after ablating a subnetwork

Experimental logic

  • Consider a compositional rule, C, with two subroutines
  • Train a base model, M C , to solve an odd-one-out task
  • Characterize the extent to which M C exhibits structural compositionality
  • Learn a binary mask over the weights of M C for each subroutine
  • Evaluate the subnetwork on two partitions of the training set
  • Ablate the subnetwork from the base model and observe the behavior
  • Determine modularity by comparing performance on the two partitions
  • Expect a positive difference in performance if model exhibits structural compositionality

Discovering subnetworks

  • Use continuous sparsification to discover subnetworks within models
  • Optimize a deterministic binary mask over network weights to produce a subnetwork that solves a task
  • Employ l0 regularization to encourage sparsity in the binary mask

Vision experiments

  • Extended collection of datasets introduced in Zerroug et al. (2022)
  • Three basic subroutines: contact, inside, and number
  • Three compositional rules: Inside-Contact, Number-Contact, and Inside-Number
  • Four types of images, each containing two shapes
  • Odd-one-out task in the vision domain defined by a compositional rule
  • Base model trained to solve odd-one-out task
  • Subnetwork discovered to implement each subroutine
  • Base model trained with cross-entropy loss
  • Mask training with l0 regularization
  • Three backbone architectures: Resnet504, Wide Resnet50, and ViT
  • Hyperparameter search over batch size and learning rate
  • Continuous sparsification parameters for each subroutine
  • Evaluate on Test Target Subroutine and Test Other Subroutine
  • M ablatei = M C − Sub i evaluated on Test Target Subroutine and Test Other Subroutine

Language experiments

  • We use a subset of data from Marvin & Linzen (2019) to construct odd-one-out tasks for language data.
  • We construct rules based on two forms of syntactic agreement: Subject-Verb Agreement and Reflexive Anaphora agreement.
  • We partition the Subject-Verb Agreement dataset into two subsets, one that targets singular sentences and one that targets plural sentences.
  • We study one architecture, BERT-Small, which is a BERT architecture with 4 hidden layers.

Results

  • Structural compositionality is favored for some architecture/task combinations
  • Subnetwork performance is higher on the subroutine it was trained to implement
  • Ablated model performance is lower on the subroutine it was trained to implement and higher on the other subroutine

Effect of pretraining on structural compositionality

  • Finetuning pretrained models improves performance
  • Pretrained weights are used for Resnet50 and BERT-Small
  • Experiments show that pretraining produces modular subnetworks
  • Pretraining makes subnetwork discovery algorithm more stable
  • Most prior work has focused on compositional generalization of standard neural models
  • Some prior work has attempted to induce an inductive bias toward compositional generalization from data
  • Prior efforts seek to attribute causality to specific components of neural networks’ internal representations
  • Present study does not require assumptions about where in the network the subroutine is implemented
  • Present study stands adjacent to network pruning research
  • Mechanistic interpretability aims to reverse engineer neural networks
  • Cammarata et al. (2020) manually inspects individual neurons in InceptionV1
  • Present study introduces a method that automatically discovers subnetworks
  • Subroutines consist of higher-level features than those in previous work

Discussion

  • Neural networks decompose tasks into subtasks and implement solutions in modular subnetworks
  • Self-supervised pretraining leads to more consistent structural compositionality in models finetuned on compositional tasks

B. continuous sparsification: extended discussion

  • Continuous sparsification attempts to optimize a binary mask that minimizes a loss function.
  • The loss function includes a standard loss function and an l0 penalty.
  • Optimizing the binary mask is intractable, so a continuous sparsification is used instead.
  • During training, a soft mask (σ) and a discrete mask (H) are interpolated.

C. mask hyperparameter search details

  • Searched over learning rates, mask parameter initializations, and mask configurations
  • Mask configurations based on stages for Resnet models and layers for transformer models
  • Best hyperparameter configuration determined based on accuracy and structural compositionality
  • Computationally expensive process due to training many separate masks

D. vit hyperparameter search results

  • Conducted hyperparameter search on ViT models
  • Used batch sizes and learning rates on 6 and 12 layer ViT models
  • MLP had a hidden layer of dimensionality 2048 and output dimensionality of 128
  • All models failed to solve any of the tasks
  • +/-Number subroutine used for training/test examples
  • All datasets had training set size of 10000, validation set size of 500, and test set size of 1000

F. language data details

  • Generated language data using templates provided by Marvin & Linzen (2019)
  • Omitted templates that position noun of interest inside sentential complement or object relative clause
  • Nouns of interest always second word of sentence
  • Split up datasets into singular and plural partitions
  • Dataset statistics identical for singular and plural instances
  • Goal is to find subnetworks that implement subroutines
  • Avoid potential complication of pretrained model needing to unlearn grammaticality computation
  • Experiment to control for randomly-initialized models
  • Results indicate that subnetworks are not causally implicated in model behavior
  • Ablating subnetworks collapses performance to chance for all tasks
  • Ablating subnetworks oftentimes yields high performance on Test Target Subroutine and low performance on Test Other Subroutine
  • Results reflect internal mechanisms of trained models, not epiphenomenal artifacts of training binary masks over networks