Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Traditional CNN models are trained and tested on low resolution images.
  • Patch Gradient Descent (PatchGD) allows existing CNN architectures to be trained on large-scale images.
  • PatchGD updates the model on small parts of the image at a time.
  • PatchGD is evaluated on two datasets with ResNet50 and MobileNetV2 models.
  • PatchGD is more stable and efficient than standard gradient-descent when handling large images.

Paper Content

Introduction

  • Convolutional neural networks (CNNs) are important for computer vision
  • CNNs can extract complex information beyond standard computer vision methods
  • Recent technological developments have led to very large images from data acquisition
  • Deep learning methods have been proposed to handle microscopy images
  • CNNs are predominantly trained and tested on low resolution images
  • High content nanoscopy involves taking nanoscopy images of several adjacent fields-of-view
  • There is information at multiple scales embedded in these microscopy images
  • Existing deep learning models using CNNs have difficulty with large images
  • Common approach is to reduce resolution of images, but this can lead to loss of information
  • Alternate strategy is to divide image into tiles, but this does not preserve semantic link
  • Very limited works address issue of handling very large images using CNNs
  • Paper presents novel CNN training pipeline capable of handling very large images
  • Patch Gradient Descent (PatchGD) is scalable on small GPUs
  • PatchGD reinvents existing CNN training pipeline and is compatible with any existing CNN architecture
  • Aim is to improve capability of CNNs in handling large-scale images
  • Limited research in this direction
  • Most works focus on histopathological datasets
  • Pixel-level segmentation masks not always available
  • Existing methods employ patch-level classification, feature extraction, or build a compressed latent representation
  • Pretrained models used as feature extractors
  • Proposed single step approach that can be trained in an end-to-end manner
  • Identifying right patches from large images and using them in a compute-effective manner to classify the whole image
  • Transformer-based methods for vision-based tasks
  • Proposed self-supervised learning objective for pre-training large-scale vision transformer
  • PatchGD method proposed - model updates on small parts of the image, ensuring majority of it is covered over iterations
  • Small subnetwork added to build end-to-end CNN model

Mathematical formulation

  • Present a detailed mathematical formulation of the proposed PatchGD approach
  • Tailor the discussion towards training of a CNN model for the task of classification
  • Solve an optimization problem to train the network
  • Standard gradient descent involves model updates using gradients computed for the loss function over one or more image samples
  • Mini-batch gradient descent uses average of gradients computed over a batch of samples
  • PatchGD computes gradients using only part of the image and updates the model parameters
  • Model update step of PatchGD involves multiple inner iterations
  • PatchGD uses an additional sub-network that looks at the full latent encoding Z
  • Algorithm 1 describes model training over one batch of B images
  • Algorithm 2 describes filling of the Z block

Experiments

  • PatchGD is effective in numerical experiments
  • Experiments were conducted on two benchmark datasets with large images and features at multiple scales

Experimental setup

  • Two datasets used: UltraMNIST and PANDA
  • Two CNN architectures used: ResNet50 and MobileNetV2
  • Hyperparameters and implementation details in Appendix C
  • Evaluation metrics: classification accuracy and QWK
  • Framework: PyTorch
  • Memory constraints: 4GB, 16GB and 24GB
  • Latency calculated on 40GB A100 GPU

Results

  • PatchGD outperforms standard gradient descent method by large margins
  • PatchGD works well with lightweight models such as MobileNetV2
  • For higher resolution images, GD faces bottleneck of batch size and only 1 image per batch is permitted
  • PatchGD performs better than GD in terms of accuracy and QWK when handling large images
  • PatchGD performs almost at par with GD in terms of latency
  • Keeping sampling fraction per inner-iteration small helps to achieve better accuracy
  • Gradient accumulation length parameter for PatchGD should be chosen based on the number of inner steps

Discussion

  • Hyperparameter optimization and fine-tuning
  • Application to other tasks
  • Limitations
  • Conclusions
  • Future work includes scaling to gigapixel images, enhanced receptive field, and PatchGD with Transformers
  • Prostate cANcer graDe Assessment Challenge dataset used
  • MNIST digits dataset used

C. training methodology

  • Trained models for 100 epochs with Adam optimizer and learning rate warm-up
  • Latent classification head consists of 4 convolutional layers with 256 channels
  • Used gradient accumulation for better convergence
  • Baselines extended with similar head
  • Original backbone architecture trained separately for 100 epochs
  • Gradient accumulation over images for baseline in PANDA at 2048 resolution

D. results: additional details

  • Patch Gradient Descent (PatchGD) is a novel CNN training strategy that can train networks with high-resolution images
  • Performance comparison of standard CNN and PatchGD for UltraMNIST classification with images of size 512 ร— 512 using ResNet50 and MobileNetV2
  • Performance scores for standard Gradient Descent and PatchGD on PANDA dataset
  • Sampling ablation on PANDA dataset
  • Influence of different number of gradient accumulation steps on performance of MobileNetV2 for UltraMNIST classification