Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Investigating how to use generative AI for neural decoding
- Two-stage scene reconstruction framework called “Brain-Diffuser”
- First stage reconstructs images with low-level properties and overall layout
- Second stage uses image-to-image framework of latent diffusion model
- Outperforms previous models on Natural Scenes Dataset benchmark
- Creates “ROI-optimal” scenes consistent with neuroscientific knowledge
Paper Content
Introduction
- Establishing neural encoding and decoding techniques is one way to discover how the brain and cognition work
- Recent developments in modeling and computation have opened up new ways of decoding information from brain signals
- Studies have used statistical techniques and machine learning to decode information from fMRI
- Deep generative models have been used to reconstruct entire images
- Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), and Latent Diffusion Models (LDM) have been used
- Images with different levels of complexity have been reconstructed, such as faces, single-object-centered images, and complex scenes
- Two datasets have been used for natural scene reconstruction: Generic Object Decoding and Deep Image Reconstruction
- Brain-Diffuser model proposed to generate scene images with high fidelity
- Brain-Diffuser model uses two stages and is conditioned on both vision and language representations
- Brain-Diffuser model demonstrates superior performance compared to earlier models
Materials and methods
Dataset
- Used publicly available Natural Scenes Dataset (NSD)
- Collected from 8 subjects viewing images from COCO dataset
- Used 4 subjects who completed all trials
- Training set contained 8859 images and 24980 fMRI trials
- Test set contained 982 images and 2770 fMRI trials
- Averaged fMRI trials for images with multiple repetitions
- Used corresponding captions from COCO dataset
- Used Generalized linear models to estimate preprocessed single-trial beta weights
- Masked preprocessed fMRI signals using NSDGeneral ROI mask
Low-level reconstruction of images using vdvae (first stage)
- VAE is a generative model used to capture an input distribution
- VDVAE is a hierarchical VAE model with several layers of conditionally dependent latent variables
- Equations 1 and 2 show the hierarchical dependence of the latent variables
- VDVAE is trained on a 64x64 resolution ImageNet dataset with 75 layers
- Latent variables from the first 31 layers are used for regression
- Ridge regression model is trained between fMRI training patterns and concatenated latent variables
- Test fMRI patterns are provided to the trained regression model to predict latent values
- Latent values are fed to the decoder part of the VDVAE to obtain reconstructed images
Final reconstruction of images using versatile diffusion (second stage)
- Used VDVAE to reconstruct image layout
- Used Versatile Diffusion 28 model in second stage of reconstruction framework
- Versatile Diffusion is a latent diffusion model
- Autoencoder trained on large-scale image dataset to learn compressed representation of images
- Forward diffusion process applied to latent variables by adding Gaussian noise
- Reverse diffusion process learned via neural network to predict and remove noise
- Versatile Diffusion model allows for conditioning on text captions, images, and semantic maps
- Versatile Diffusion model trained on Laion2B-en 32 dataset
- CLIP network based on transformer architecture
- Two regression models trained between fMRI patterns and CLIP-Vision/Text features
- Image-to-image pipeline of latent diffusion model used at testing time
- CLIP-Vision and CLIP-Text used jointly in double-guided diffusion pipeline
Code availability
- Code for project is publicly available
- Examples of reconstructions in Figure 3
- Results from individual subjects and average of all subjects
- Reconstructed images capture layout and semantics of groundtruth images
- Differences in pixel-level details remain
- Reconstructed images are naturalistic and plausible alternate renditions of ground truth
Results and analyses
Comparison with state of the art
- Lin et al. used StyleGAN2 model and image/text features
- Lin et al. performed better than other two models in some instances
- Takagi et al. used latent diffusion model, but not as good as our model
- Gu et al. used Instance-Conditioned GAN model trained on ImageNet
- Our model better at reconstructing shape and texture details
- Our model better at reconstructing complex scenes with multiple objects
Quantitative results
- 8 different image quality metrics are used to compare models
- 4 of the metrics are low-level, 4 are high-level
- Images are preprocessed according to the input properties of each network
- Our model is the best-performing model by a decent margin for all of the quantitative metrics
- Ablations of the main model show that each component provides reliable improvements
Roi analysis
- Method can be used to understand functional properties of brain regions
- Early studies in neuroscience literature identified visual properties that activate neurons in each brain region
- Method can be adapted to visualize “ROI-optimal” images
- Synthetic fMRI patterns created by setting voxels of a specific ROI to ones and the rest to zeros
- Generated images confirm decades of knowledge from neuroscience literature
- Technique can be extended to study retinotopic or eccentricity-based cortical organization
Discussion
- Designed two-stage framework (Brain-Diffuser) to reconstruct images from fMRI patterns
- Used VDVAE model to generate “initial guess” reconstructions
- Used image-to-image pipeline of Versatile Diffusion model to generate final reconstructions
- Results showed reconstructed images preserve most of layout and semantic information
- Results showed Brain-Diffuser outperforms previous models in both high-level and low-level metrics
- Future work could design novel experiments and analyses on NSD dataset using generative models
Supplementary
- VDVAE model provides best results for low-level measures, worst for high-level measures
- Brain-Diffuser without VDVAE performs worst on low-level measures, best on high-level measures
- Full Brain-Diffuser model is best for both low-level and high-level measures
- Brain-Diffuser without CLIP-Text has a sizeable decrement in both low-level and high-level measures
- Brain-Diffuser without CLIP-Vision retains high performance on low-level measures
- Brain-Diffuser with all components is optimal for both low-level and high-level measures
- Brain-Diffuser can learn eccentricity-based retinotopic organization of the cortex
- Quantitative results show that full Brain-Diffuser model is best for all measures