Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Investigating how to use generative AI for neural decoding
Two-stage scene reconstruction framework called “Brain-Diffuser”
First stage reconstructs images with low-level properties and overall layout
Second stage uses image-to-image framework of latent diffusion model
Outperforms previous models on Natural Scenes Dataset benchmark
Creates “ROI-optimal” scenes consistent with neuroscientific knowledge

Paper Content

Introduction

Establishing neural encoding and decoding techniques is one way to discover how the brain and cognition work
Recent developments in modeling and computation have opened up new ways of decoding information from brain signals
Studies have used statistical techniques and machine learning to decode information from fMRI
Deep generative models have been used to reconstruct entire images
Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), and Latent Diffusion Models (LDM) have been used
Images with different levels of complexity have been reconstructed, such as faces, single-object-centered images, and complex scenes
Two datasets have been used for natural scene reconstruction: Generic Object Decoding and Deep Image Reconstruction
Brain-Diffuser model proposed to generate scene images with high fidelity
Brain-Diffuser model uses two stages and is conditioned on both vision and language representations
Brain-Diffuser model demonstrates superior performance compared to earlier models

Materials and methods

Dataset

Used publicly available Natural Scenes Dataset (NSD)
Collected from 8 subjects viewing images from COCO dataset
Used 4 subjects who completed all trials
Training set contained 8859 images and 24980 fMRI trials
Test set contained 982 images and 2770 fMRI trials
Averaged fMRI trials for images with multiple repetitions
Used corresponding captions from COCO dataset
Used Generalized linear models to estimate preprocessed single-trial beta weights
Masked preprocessed fMRI signals using NSDGeneral ROI mask

Low-level reconstruction of images using vdvae (first stage)

VAE is a generative model used to capture an input distribution
VDVAE is a hierarchical VAE model with several layers of conditionally dependent latent variables
Equations 1 and 2 show the hierarchical dependence of the latent variables
VDVAE is trained on a 64x64 resolution ImageNet dataset with 75 layers
Latent variables from the first 31 layers are used for regression
Ridge regression model is trained between fMRI training patterns and concatenated latent variables
Test fMRI patterns are provided to the trained regression model to predict latent values
Latent values are fed to the decoder part of the VDVAE to obtain reconstructed images

Final reconstruction of images using versatile diffusion (second stage)

Used VDVAE to reconstruct image layout
Used Versatile Diffusion 28 model in second stage of reconstruction framework
Versatile Diffusion is a latent diffusion model
Autoencoder trained on large-scale image dataset to learn compressed representation of images
Forward diffusion process applied to latent variables by adding Gaussian noise
Reverse diffusion process learned via neural network to predict and remove noise
Versatile Diffusion model allows for conditioning on text captions, images, and semantic maps
Versatile Diffusion model trained on Laion2B-en 32 dataset
CLIP network based on transformer architecture
Two regression models trained between fMRI patterns and CLIP-Vision/Text features
Image-to-image pipeline of latent diffusion model used at testing time
CLIP-Vision and CLIP-Text used jointly in double-guided diffusion pipeline

Code availability

Code for project is publicly available
Examples of reconstructions in Figure 3
Results from individual subjects and average of all subjects
Reconstructed images capture layout and semantics of groundtruth images
Differences in pixel-level details remain
Reconstructed images are naturalistic and plausible alternate renditions of ground truth

Results and analyses

Comparison with state of the art

Lin et al. used StyleGAN2 model and image/text features
Lin et al. performed better than other two models in some instances
Takagi et al. used latent diffusion model, but not as good as our model
Gu et al. used Instance-Conditioned GAN model trained on ImageNet
Our model better at reconstructing shape and texture details
Our model better at reconstructing complex scenes with multiple objects

Quantitative results

8 different image quality metrics are used to compare models
4 of the metrics are low-level, 4 are high-level
Images are preprocessed according to the input properties of each network
Our model is the best-performing model by a decent margin for all of the quantitative metrics
Ablations of the main model show that each component provides reliable improvements

Roi analysis

Method can be used to understand functional properties of brain regions
Early studies in neuroscience literature identified visual properties that activate neurons in each brain region
Method can be adapted to visualize “ROI-optimal” images
Synthetic fMRI patterns created by setting voxels of a specific ROI to ones and the rest to zeros
Generated images confirm decades of knowledge from neuroscience literature
Technique can be extended to study retinotopic or eccentricity-based cortical organization

Discussion

Designed two-stage framework (Brain-Diffuser) to reconstruct images from fMRI patterns
Used VDVAE model to generate “initial guess” reconstructions
Used image-to-image pipeline of Versatile Diffusion model to generate final reconstructions
Results showed reconstructed images preserve most of layout and semantic information
Results showed Brain-Diffuser outperforms previous models in both high-level and low-level metrics
Future work could design novel experiments and analyses on NSD dataset using generative models

Supplementary

VDVAE model provides best results for low-level measures, worst for high-level measures
Brain-Diffuser without VDVAE performs worst on low-level measures, best on high-level measures
Full Brain-Diffuser model is best for both low-level and high-level measures
Brain-Diffuser without CLIP-Text has a sizeable decrement in both low-level and high-level measures
Brain-Diffuser without CLIP-Vision retains high performance on low-level measures
Brain-Diffuser with all components is optimal for both low-level and high-level measures
Brain-Diffuser can learn eccentricity-based retinotopic organization of the cortex
Quantitative results show that full Brain-Diffuser model is best for all measures

Link to paper#

Abstract#

Paper Content#

Introduction#

Materials and methods#

Dataset#

Low-level reconstruction of images using vdvae (first stage)#

Final reconstruction of images using versatile diffusion (second stage)#

Code availability#

Results and analyses#

Comparison with state of the art#

Quantitative results#

Roi analysis#

Discussion#

Supplementary#

Link to paper

Abstract

Paper Content

Introduction

Materials and methods

Dataset

Low-level reconstruction of images using vdvae (first stage)

Final reconstruction of images using versatile diffusion (second stage)

Code availability

Results and analyses

Comparison with state of the art

Quantitative results

Roi analysis

Discussion

Supplementary