Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Investigating how to use generative AI for neural decoding
  • Two-stage scene reconstruction framework called “Brain-Diffuser”
  • First stage reconstructs images with low-level properties and overall layout
  • Second stage uses image-to-image framework of latent diffusion model
  • Outperforms previous models on Natural Scenes Dataset benchmark
  • Creates “ROI-optimal” scenes consistent with neuroscientific knowledge

Paper Content

Introduction

  • Establishing neural encoding and decoding techniques is one way to discover how the brain and cognition work
  • Recent developments in modeling and computation have opened up new ways of decoding information from brain signals
  • Studies have used statistical techniques and machine learning to decode information from fMRI
  • Deep generative models have been used to reconstruct entire images
  • Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), and Latent Diffusion Models (LDM) have been used
  • Images with different levels of complexity have been reconstructed, such as faces, single-object-centered images, and complex scenes
  • Two datasets have been used for natural scene reconstruction: Generic Object Decoding and Deep Image Reconstruction
  • Brain-Diffuser model proposed to generate scene images with high fidelity
  • Brain-Diffuser model uses two stages and is conditioned on both vision and language representations
  • Brain-Diffuser model demonstrates superior performance compared to earlier models

Materials and methods

Dataset

  • Used publicly available Natural Scenes Dataset (NSD)
  • Collected from 8 subjects viewing images from COCO dataset
  • Used 4 subjects who completed all trials
  • Training set contained 8859 images and 24980 fMRI trials
  • Test set contained 982 images and 2770 fMRI trials
  • Averaged fMRI trials for images with multiple repetitions
  • Used corresponding captions from COCO dataset
  • Used Generalized linear models to estimate preprocessed single-trial beta weights
  • Masked preprocessed fMRI signals using NSDGeneral ROI mask

Low-level reconstruction of images using vdvae (first stage)

  • VAE is a generative model used to capture an input distribution
  • VDVAE is a hierarchical VAE model with several layers of conditionally dependent latent variables
  • Equations 1 and 2 show the hierarchical dependence of the latent variables
  • VDVAE is trained on a 64x64 resolution ImageNet dataset with 75 layers
  • Latent variables from the first 31 layers are used for regression
  • Ridge regression model is trained between fMRI training patterns and concatenated latent variables
  • Test fMRI patterns are provided to the trained regression model to predict latent values
  • Latent values are fed to the decoder part of the VDVAE to obtain reconstructed images

Final reconstruction of images using versatile diffusion (second stage)

  • Used VDVAE to reconstruct image layout
  • Used Versatile Diffusion 28 model in second stage of reconstruction framework
  • Versatile Diffusion is a latent diffusion model
  • Autoencoder trained on large-scale image dataset to learn compressed representation of images
  • Forward diffusion process applied to latent variables by adding Gaussian noise
  • Reverse diffusion process learned via neural network to predict and remove noise
  • Versatile Diffusion model allows for conditioning on text captions, images, and semantic maps
  • Versatile Diffusion model trained on Laion2B-en 32 dataset
  • CLIP network based on transformer architecture
  • Two regression models trained between fMRI patterns and CLIP-Vision/Text features
  • Image-to-image pipeline of latent diffusion model used at testing time
  • CLIP-Vision and CLIP-Text used jointly in double-guided diffusion pipeline

Code availability

  • Code for project is publicly available
  • Examples of reconstructions in Figure 3
  • Results from individual subjects and average of all subjects
  • Reconstructed images capture layout and semantics of groundtruth images
  • Differences in pixel-level details remain
  • Reconstructed images are naturalistic and plausible alternate renditions of ground truth

Results and analyses

Comparison with state of the art

  • Lin et al. used StyleGAN2 model and image/text features
  • Lin et al. performed better than other two models in some instances
  • Takagi et al. used latent diffusion model, but not as good as our model
  • Gu et al. used Instance-Conditioned GAN model trained on ImageNet
  • Our model better at reconstructing shape and texture details
  • Our model better at reconstructing complex scenes with multiple objects

Quantitative results

  • 8 different image quality metrics are used to compare models
  • 4 of the metrics are low-level, 4 are high-level
  • Images are preprocessed according to the input properties of each network
  • Our model is the best-performing model by a decent margin for all of the quantitative metrics
  • Ablations of the main model show that each component provides reliable improvements

Roi analysis

  • Method can be used to understand functional properties of brain regions
  • Early studies in neuroscience literature identified visual properties that activate neurons in each brain region
  • Method can be adapted to visualize “ROI-optimal” images
  • Synthetic fMRI patterns created by setting voxels of a specific ROI to ones and the rest to zeros
  • Generated images confirm decades of knowledge from neuroscience literature
  • Technique can be extended to study retinotopic or eccentricity-based cortical organization

Discussion

  • Designed two-stage framework (Brain-Diffuser) to reconstruct images from fMRI patterns
  • Used VDVAE model to generate “initial guess” reconstructions
  • Used image-to-image pipeline of Versatile Diffusion model to generate final reconstructions
  • Results showed reconstructed images preserve most of layout and semantic information
  • Results showed Brain-Diffuser outperforms previous models in both high-level and low-level metrics
  • Future work could design novel experiments and analyses on NSD dataset using generative models

Supplementary

  • VDVAE model provides best results for low-level measures, worst for high-level measures
  • Brain-Diffuser without VDVAE performs worst on low-level measures, best on high-level measures
  • Full Brain-Diffuser model is best for both low-level and high-level measures
  • Brain-Diffuser without CLIP-Text has a sizeable decrement in both low-level and high-level measures
  • Brain-Diffuser without CLIP-Vision retains high performance on low-level measures
  • Brain-Diffuser with all components is optimal for both low-level and high-level measures
  • Brain-Diffuser can learn eccentricity-based retinotopic organization of the cortex
  • Quantitative results show that full Brain-Diffuser model is best for all measures