Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Traditional CNN models are trained and tested on low resolution images.
- Patch Gradient Descent (PatchGD) allows existing CNN architectures to be trained on large-scale images.
- PatchGD updates the model on small parts of the image at a time.
- PatchGD is evaluated on two datasets with ResNet50 and MobileNetV2 models.
- PatchGD is more stable and efficient than standard gradient-descent when handling large images.
Paper Content
Introduction
- Convolutional neural networks (CNNs) are important for computer vision
- CNNs can extract complex information beyond standard computer vision methods
- Recent technological developments have led to very large images from data acquisition
- Deep learning methods have been proposed to handle microscopy images
- CNNs are predominantly trained and tested on low resolution images
- High content nanoscopy involves taking nanoscopy images of several adjacent fields-of-view
- There is information at multiple scales embedded in these microscopy images
- Existing deep learning models using CNNs have difficulty with large images
- Common approach is to reduce resolution of images, but this can lead to loss of information
- Alternate strategy is to divide image into tiles, but this does not preserve semantic link
- Very limited works address issue of handling very large images using CNNs
- Paper presents novel CNN training pipeline capable of handling very large images
- Patch Gradient Descent (PatchGD) is scalable on small GPUs
- PatchGD reinvents existing CNN training pipeline and is compatible with any existing CNN architecture
Related work
- Aim is to improve capability of CNNs in handling large-scale images
- Limited research in this direction
- Most works focus on histopathological datasets
- Pixel-level segmentation masks not always available
- Existing methods employ patch-level classification, feature extraction, or build a compressed latent representation
- Pretrained models used as feature extractors
- Proposed single step approach that can be trained in an end-to-end manner
- Identifying right patches from large images and using them in a compute-effective manner to classify the whole image
- Transformer-based methods for vision-based tasks
- Proposed self-supervised learning objective for pre-training large-scale vision transformer
- PatchGD method proposed - model updates on small parts of the image, ensuring majority of it is covered over iterations
- Small subnetwork added to build end-to-end CNN model
Mathematical formulation
- Present a detailed mathematical formulation of the proposed PatchGD approach
- Tailor the discussion towards training of a CNN model for the task of classification
- Solve an optimization problem to train the network
- Standard gradient descent involves model updates using gradients computed for the loss function over one or more image samples
- Mini-batch gradient descent uses average of gradients computed over a batch of samples
- PatchGD computes gradients using only part of the image and updates the model parameters
- Model update step of PatchGD involves multiple inner iterations
- PatchGD uses an additional sub-network that looks at the full latent encoding Z
- Algorithm 1 describes model training over one batch of B images
- Algorithm 2 describes filling of the Z block
Experiments
- PatchGD is effective in numerical experiments
- Experiments were conducted on two benchmark datasets with large images and features at multiple scales
Experimental setup
- Two datasets used: UltraMNIST and PANDA
- Two CNN architectures used: ResNet50 and MobileNetV2
- Hyperparameters and implementation details in Appendix C
- Evaluation metrics: classification accuracy and QWK
- Framework: PyTorch
- Memory constraints: 4GB, 16GB and 24GB
- Latency calculated on 40GB A100 GPU
Results
- PatchGD outperforms standard gradient descent method by large margins
- PatchGD works well with lightweight models such as MobileNetV2
- For higher resolution images, GD faces bottleneck of batch size and only 1 image per batch is permitted
- PatchGD performs better than GD in terms of accuracy and QWK when handling large images
- PatchGD performs almost at par with GD in terms of latency
- Keeping sampling fraction per inner-iteration small helps to achieve better accuracy
- Gradient accumulation length parameter for PatchGD should be chosen based on the number of inner steps
Discussion
- Hyperparameter optimization and fine-tuning
- Application to other tasks
- Limitations
- Conclusions
- Future work includes scaling to gigapixel images, enhanced receptive field, and PatchGD with Transformers
- Prostate cANcer graDe Assessment Challenge dataset used
- MNIST digits dataset used
C. training methodology
- Trained models for 100 epochs with Adam optimizer and learning rate warm-up
- Latent classification head consists of 4 convolutional layers with 256 channels
- Used gradient accumulation for better convergence
- Baselines extended with similar head
- Original backbone architecture trained separately for 100 epochs
- Gradient accumulation over images for baseline in PANDA at 2048 resolution
D. results: additional details
- Patch Gradient Descent (PatchGD) is a novel CNN training strategy that can train networks with high-resolution images
- Performance comparison of standard CNN and PatchGD for UltraMNIST classification with images of size 512 ร 512 using ResNet50 and MobileNetV2
- Performance scores for standard Gradient Descent and PatchGD on PANDA dataset
- Sampling ablation on PANDA dataset
- Influence of different number of gradient accumulation steps on performance of MobileNetV2 for UltraMNIST classification