Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- DNNs can accurately capture the hierarchy of neural responses in the mammalian visual system.
- DNNs have been less successful in explaining representations in higher cortical areas.
- A novel scene perception benchmark has been designed to probe the ability of DNNs to transform scenes viewed from different perspectives.
- A network architecture inspired by the connectivity between temporal lobe structures and the hippocampus has been used to demonstrate that DNNs can learn this task.
- A factorized latent space has been used to split information propagation into “what” and “where” pathways.
- This has allowed for the state-of-the-art for unsupervised object segmentation on the CATER and MOVi-A,B,C benchmarks to be beaten.
Paper Content
Introduction
- Neural networks can produce coherent scene understanding and synthesize novel views
- Structures along the visual cortex to the hippocampal formation govern transformation from egocentric to allocentric
- Hippocampus is necessary for memory and perception of places and events
- Computational models needed to explain single-cell responses across transformation circuit
- Developed a scene recognition model to understand transformation from egocentric to allocentric
- Developed a novel hippocampally dependent task inspired by 4-Mountains-Test
- Model architecture inspired by recent work in scene perception
- Model trained using triplet loss to distinguish between scenes across different viewpoints
- Reconstruction loss combined with pixel-wise decoder used for unsupervised object segmentation
- Study of scene perception in neuroscience first explored in behavioural experiments in 1972
Related work in computer science
- Scene perception models have been used to tackle the problem of novel view synthesis.
- SLAM algorithms have been used in robotics to represent scenes and navigate within them.
- Neural radiance field networks try to mimic the image synthesis based on real-world physics.
- Traditional scene decomposition methods use a latent space to produce new views.
Task design
- A model was built to deal with randomly sampled egocentric sensory observations.
- The model was inspired by novel view synthesis tasks and the 4-Mountains-Test.
- The 4-Mountains-Test is used in the clinic and is sensitive to hippocampal damage.
- The task was simplified by rendering four objects with circular symmetry and a global reference frame.
Model architecture
- Model architecture follows known biological connectivity between visual cortices and hippocampal formation
- Pre-trained convolutional neural network used to simulate responses of visual cortex
- Activations from V4 and IT cortex extracted for each rendered viewpoint
- Information from V4 and IT routed through PR and PH cortices using weak or strong connections
- MEC and LEC entorhinal cortex receive input averaged across time and space respectively
- Hippocampus consists of CA3 and CA1 layers
- CA1 layer split into temporal and spatial information
- Pixel-wise decoder used to reconstruct input across novel views
Model optimization
- Model receives egocentric sensory observations which are transformed into allocentric representations in the hippocampal formation
- Model is trained by minimizing either a triplet loss on the hippocampal latent space or an L2-reconstruction loss in pixel space
- L2-reconstruction loss is calculated between the predicted reconstruction and the original image
- Final loss is summed across the width, height and timesteps and averaged across batches
- Predictive objective function is similar to the reconstruction loss as the latent space enforces the reconstruction of scenes from different viewpoints
Results
Performance on adapted 4-mountains-test
- Model trained to separate between different scenes
- Triplet loss used to disentangle hippocampal representation between different scenes
- Performance evaluated by calculating cross-correlation matrix
- Pre-trained layers and image itself show performance levels around chance
- IT performs best at 39%
- Late layers in model construct world-centred representation of environment
- CA3 and CA1 show performance close to 90%
Neural representations within network layers
- Allocentricity measure is used to quantify amount of allocentric information in each trained layer
- Hippocampal layers have highest allocentricity score
- Later layers incorporate more high-level information
- Allocentric position and scene identity can be read out from hippocampal layer CA3
- Egocentric information is reduced in layers beyond MEC
- CA1 layer retains allocentric information
- Visualized activity of neurons in different reference frames shows allocentric boundary and place-like activity
Reconstructing the input through feedback connections
- Model is able to discern between novel scenes from arbitrary viewpoints
- Added additional reconstruction loss in pixel space
- Used factorized latent space to sample objects and frame information for each individual pixel and time point
- Mental imagination is guided by viewpoint-changing signal combined with object information
- Reconstruction loss varies depending on object colours used
- Triplet loss helps to differentiate between scenes in late layers of the model
Model performance on cater and movi
- Evaluated neural network architecture on CATER benchmark
- Compared FG-ARI across different unsupervised models
- MONet is a staticframe model
- SAVi uses optical flow as a supervision signal
- Model performance obtained by training for 200000 steps with fixed learning rate
- Model architecture for higher-level cortices replaced with four convolutional layers
- Model able to reconstruct input frames and segment objects on CATER dataset
- Model performance comparable or better than baseline models on CATER
Discussion
- Artificial neural network was trained to perform allocentric topographical processing
- Model uses visual representations from V4 and IT and higher-level areas like the entorhinal and hippocampus
- Model can discern between hundreds of scenes and generalize beyond its training set
- Model can reconstruct visual input and disentangle object from spatial information
- Model can perform novel view synthesis and segment objects
- Model size is relatively small and may not perform well on more challenging datasets
- Future exploration of neural representations across objective functions used