Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Humans have neurophysiological mechanisms that enable efficient continual learning.
- Brain encodes information in non-overlapping sparse codes to facilitate learning of new associations.
- DNNs use activation sparsity and dropout to mimic sparse coding.
- Multiple-memory replay mechanism maintains a long-term semantic memory.
Paper Content
Introduction
- Standard DNNs are not designed for lifelong learning and exhibit catastrophic forgetting of previously learned knowledge.
- The core challenge in continual learning (CL) in DNNs is to maintain an optimal balance between plasticity and the stability of the model.
- Rehearsal-based methods have proven to be an effective approach in challenging CL tasks.
- The human brain provides an existence proof for successful CL in complex dynamic environments.
- The human brain maintains a delicate balance between stability and plasticity through a complex set of neurophysiological mechanisms.
- Sparse Coding and Complementary Learning Systems (CLS) theory can provide insight into the design principles and mechanisms that can enable CL in DNNs.
- SCoMMER employs sparse coding in a multi-memory experience replay mechanism to encourage sparse coding in DNNs and mimic the interplay of multiple memory systems.
Related work
- Three categories of approaches to address catastrophic forgetting in CL: regularization-based, dynamic architecture, and rehearsal-based
- Rehearsal-based methods have proven to be effective in challenging CL scenarios
- Different aspects of rehearsal have been studied, such as memory sample selection, sample retrieval, and what information to extract and replay
- CLS theory has inspired approaches that utilize multiple memory systems
Methodology
- Overview of motivation from biological systems
- Introduction to components of proposed approach
Continual learning in the biological system
- Brain uses Sparse Coding to represent information
- Sparse codes provide advantages for learning new associations
- Brain has specialized regions and multiple memory systems
- Complementary Learning Systems theory states two systems are needed for efficient learning
- Method aims to incorporate components of brain in ANNs
Sparse coding in dnns
- Brain has sparse neural codes, while DNNs have dense connections and overlapping representations
- Sparse representations can reduce interference between tasks and lead to less forgetting
- Activation sparsity can lead to natural emergence of modules
- To mimic sparse coding in DNNs, activation sparsity and semantic dropout are employed
- K-winner-take-all activation function is used
- Activation score is assigned to each filter in the layer
- Sparsity ratio is set for each layer
- Semantic dropout encourages similar units for semantically similar samples
- Probability of retention of a unit is proportional to the number of times it has been activated for that class
- Activation sparsity and semantic dropout provide efficient mechanism for balancing reusability and interference of features
Multiple memory systems
- Episodic Memory: Fixed-size buffer to store information
- Long-Term Memory: Aggregates weights of working memory over time
- Long-Term Memory: Builds structural representations for generalization
Overall formulation
- CL task is to learn joint distribution of observed tasks without task labels at test time
- SCoMMER involves training a working memory and maintaining an additional long-term memory and episodic memory
- Long-term memory is initialized with same parameters as working memory and has same sparsity constraints
- Heterogeneous dropout probabilities are randomly initialized
- Overall loss is combination of cross-entropy and knowledge retrieval loss
- Long-term memory is used for inference
Evaluation protocol
- SCoMMER is a model designed to tackle the challenges of lifelong learning
- Class-IL presents a challenging CL scenario where the model must learn to distinguish between all the classes seen so far without task labels
- GCIL extends the Class-IL setting to more realistic scenarios with class imbalance and varying number of classes
- Task-IL setting uses task labels to select the subset of output logits
Empirical evaluation
- SCoMMER is compared to state-of-the-art rehearsal-based methods
- SCoMMER provides performance gains in the majority of cases
- SCoMMER provides considerable improvement under low buffer size settings
- SCoMMER enables effective utilization of a single semantic memory
- SCoMMER provides advantages in GCIL setting
- SCoMMER enables reuse of sparse code associated with recurring objects
- SCoMMER leads to emergence of subnetworks that provide modularity and protection
Ablation study
- SCoMMER contributes to performance gains
- Semantic dropout enforces sparse coding and reduces interference between tasks
- Multiple memory systems provide considerable performance improvement
- Sparsity is critical for enabling CL in DNNs
- Sparse activations improve ER performance and enable efficient utilization of semantic memory
Characteristics analysis
- Analyzed models trained on S-CIFAR100
- Buffer size of 200 used
Stability-plasticity dilemma
- SCoMMER is able to maintain a better balance between stability and plasticity than other methods.
- SCoMMER provides more uniform performance on tasks compared to baselines.
- SCoMMER retains performance on earlier tasks and provides good performance on recent tasks.
- Long-term memory aggregates learned knowledge into synaptic weights and generalizes well across tasks.
Emergence of subnetworks
- Average activity of units in penultimate layer is tracked during training
- High correlation between test set activation counts and semantic dropout probabilities
- Activation counts hint at emergence of semantically conditioned subnets
- Semantically similar classes have higher degree of correlation between their activation patterns
Task recency bias
- Recency bias leads to forgetting of earlier tasks
- SCoMMER provides more uniform probabilities to predict each task
- CLS-ER mitigates bias towards the last task but reduces probability of predicting the last task
- SCoMMER effectively mitigates recency bias
Conclusion
- Proposed a novel approach to employ sparse coding in multiple memory systems
- Enforces activation sparsity and a complementary semantic dropout mechanism
- Maintains long-term memory to consolidate learned knowledge
- Evaluation shows effectiveness of approach in mitigating forgetting
- Sparse coding enables efficient consolidation of knowledge in long-term memory
- Reduces bias towards recent tasks and leads to emergence of semantically conditioned subnetworks
- Use same network, optimizer, batch size, data augmentations, and number of epochs for all experiments
- Hyperparameter tuning uses small held-out validation set and grid search
- Consider Class-IL and Generalized Class-IL settings for empirical evaluation
- Implementation of k-WTA activation for CNNs and semantic dropout mechanism
- Initialize heterogeneous dropout probabilities to ensure learning of first task does not utilize all filters
- Semantic dropout probabilities updated at end of each epoch
- Dropout applied only to working memory
- Long-term memory provides better generalization across tasks
- Table S2 provides performance of both working memory and long-term memory
- Table S3 shows effect of different sets of hyperparameters
- Not highly sensitive to particular set of parameters