Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

DNNs can accurately capture the hierarchy of neural responses in the mammalian visual system.
DNNs have been less successful in explaining representations in higher cortical areas.
A novel scene perception benchmark has been designed to probe the ability of DNNs to transform scenes viewed from different perspectives.
A network architecture inspired by the connectivity between temporal lobe structures and the hippocampus has been used to demonstrate that DNNs can learn this task.
A factorized latent space has been used to split information propagation into “what” and “where” pathways.
This has allowed for the state-of-the-art for unsupervised object segmentation on the CATER and MOVi-A,B,C benchmarks to be beaten.

Paper Content

Introduction

Neural networks can produce coherent scene understanding and synthesize novel views
Structures along the visual cortex to the hippocampal formation govern transformation from egocentric to allocentric
Hippocampus is necessary for memory and perception of places and events
Computational models needed to explain single-cell responses across transformation circuit
Developed a scene recognition model to understand transformation from egocentric to allocentric
Developed a novel hippocampally dependent task inspired by 4-Mountains-Test
Model architecture inspired by recent work in scene perception
Model trained using triplet loss to distinguish between scenes across different viewpoints
Reconstruction loss combined with pixel-wise decoder used for unsupervised object segmentation
Study of scene perception in neuroscience first explored in behavioural experiments in 1972

Scene perception models have been used to tackle the problem of novel view synthesis.
SLAM algorithms have been used in robotics to represent scenes and navigate within them.
Neural radiance field networks try to mimic the image synthesis based on real-world physics.
Traditional scene decomposition methods use a latent space to produce new views.

Task design

A model was built to deal with randomly sampled egocentric sensory observations.
The model was inspired by novel view synthesis tasks and the 4-Mountains-Test.
The 4-Mountains-Test is used in the clinic and is sensitive to hippocampal damage.
The task was simplified by rendering four objects with circular symmetry and a global reference frame.

Model architecture

Model architecture follows known biological connectivity between visual cortices and hippocampal formation
Pre-trained convolutional neural network used to simulate responses of visual cortex
Activations from V4 and IT cortex extracted for each rendered viewpoint
Information from V4 and IT routed through PR and PH cortices using weak or strong connections
MEC and LEC entorhinal cortex receive input averaged across time and space respectively
Hippocampus consists of CA3 and CA1 layers
CA1 layer split into temporal and spatial information
Pixel-wise decoder used to reconstruct input across novel views

Model optimization

Model receives egocentric sensory observations which are transformed into allocentric representations in the hippocampal formation
Model is trained by minimizing either a triplet loss on the hippocampal latent space or an L2-reconstruction loss in pixel space
L2-reconstruction loss is calculated between the predicted reconstruction and the original image
Final loss is summed across the width, height and timesteps and averaged across batches
Predictive objective function is similar to the reconstruction loss as the latent space enforces the reconstruction of scenes from different viewpoints

Results

Performance on adapted 4-mountains-test

Model trained to separate between different scenes
Triplet loss used to disentangle hippocampal representation between different scenes
Performance evaluated by calculating cross-correlation matrix
Pre-trained layers and image itself show performance levels around chance
IT performs best at 39%
Late layers in model construct world-centred representation of environment
CA3 and CA1 show performance close to 90%

Neural representations within network layers

Allocentricity measure is used to quantify amount of allocentric information in each trained layer
Hippocampal layers have highest allocentricity score
Later layers incorporate more high-level information
Allocentric position and scene identity can be read out from hippocampal layer CA3
Egocentric information is reduced in layers beyond MEC
CA1 layer retains allocentric information
Visualized activity of neurons in different reference frames shows allocentric boundary and place-like activity

Reconstructing the input through feedback connections

Model is able to discern between novel scenes from arbitrary viewpoints
Added additional reconstruction loss in pixel space
Used factorized latent space to sample objects and frame information for each individual pixel and time point
Mental imagination is guided by viewpoint-changing signal combined with object information
Reconstruction loss varies depending on object colours used
Triplet loss helps to differentiate between scenes in late layers of the model

Model performance on cater and movi

Evaluated neural network architecture on CATER benchmark
Compared FG-ARI across different unsupervised models
MONet is a staticframe model
SAVi uses optical flow as a supervision signal
Model performance obtained by training for 200000 steps with fixed learning rate
Model architecture for higher-level cortices replaced with four convolutional layers
Model able to reconstruct input frames and segment objects on CATER dataset
Model performance comparable or better than baseline models on CATER

Discussion

Artificial neural network was trained to perform allocentric topographical processing
Model uses visual representations from V4 and IT and higher-level areas like the entorhinal and hippocampus
Model can discern between hundreds of scenes and generalize beyond its training set
Model can reconstruct visual input and disentangle object from spatial information
Model can perform novel view synthesis and segment objects
Model size is relatively small and may not perform well on more challenging datasets
Future exploration of neural representations across objective functions used

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work in computer science#

Task design#

Model architecture#

Model optimization#

Results#

Performance on adapted 4-mountains-test#

Neural representations within network layers#

Reconstructing the input through feedback connections#

Model performance on cater and movi#

Discussion#

Link to paper

Abstract

Paper Content

Introduction

Related work in computer science

Task design

Model architecture

Model optimization

Results

Performance on adapted 4-mountains-test

Neural representations within network layers

Reconstructing the input through feedback connections

Model performance on cater and movi

Discussion