Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • State-of-the-art approaches for weather and climate modeling are based on physics-informed numerical models.
  • Data-driven approaches based on machine learning aim to directly solve a downstream forecasting or projection task.
  • These networks are trained using curated and homogeneous climate datasets.
  • ClimaX is a flexible and generalizable deep learning model for weather and climate science.
  • ClimaX is pre-trained with a self-supervised learning objective on climate datasets.
  • ClimaX can be fine-tuned to address a breadth of climate and weather tasks.
  • ClimaX results in superior performance on benchmarks for weather forecasting and climate projections.

Paper Content

Introduction

  • Modeling weather and climate is a challenge for science and society.
  • Numerical methods for global modeling of weather and climate are parameterized via general circulation models (GCMs).
  • GCMs have limitations in simulating atmospheric variables quickly at short time scales or accurately at long time scales.
  • Data-driven approaches for forecasting of atmospheric variables have been rising.
  • Deep neural networks are trained to predict target atmospheric variables using historical global datasets.
  • ML models trained to solve a single task using supervised learning are label-hungry and brittle when deployed outside their training distribution.
  • Pretraining large unsupervised “foundation” models on huge passive datasets can mitigate the supervision bottleneck.
  • ClimaX is proposed as a foundation model for weather and climate.
  • ClimaX is pretrained on a large dataset using an unsupervised objective.
  • ClimaX uses a vision transformer and a randomized forecasting objective.
  • ClimaX is benchmarked and is state-of-the-art on ClimateBench.
  • ClimaX can scale using heterogeneous climate datasets during pretraining.
  • Current weather and climate models rely on numerical methods and computational simulations.
  • Primitive equations are at the core of both weather and climate models.
  • Earth system models (ESM) are used for climate modeling.
  • Numerical Weather Prediction (NWP) models share components with GCMs.

Data sources

  • Weather and climate data is not solely based on sensed data, but incorporates information from a range of sources.
  • Data measurements are heterogeneous, representing various physical variables with different data types.
  • Data sources span multiple axes, from direct weather measurements to physics-informed climate projections.

Era5

  • ERA5 reanalysis archive is the main data source for weather forecasting systems
  • ERA5 reanalysis is a detailed record of global atmosphere, land surface and ocean waves from 1950 onwards
  • ERA5 reanalysis combines forecasting model with available observations
  • ERA5 reanalysis data is huge: 40 years, 0.25°× 0.25° grid, hourly intervals, 37 altitude levels

Tasks

  • Machine learning is being used for weather and climate modeling tasks.
  • Global forecasting tasks range from a few hours to days and weeks.
  • Evaluation is done on the ERA5 reanalysis dataset with Operational IFS of ECMWF being the current state-of-the-art NWP baselines.
  • ClimateBench is a benchmark dataset providing an evaluation framework for machine learning models to improve accuracy of climate projections.

Foundation models

  • Bommasani et al. introduced the term “foundation models” to refer to deep learning models trained on broad data via self-supervision
  • Examples of foundation models include BERT, GPT and PaLM
  • Foundation models have been applied to data from web and scientific domains like protein design
  • Key significance of foundation models is emergence of model capabilities and homogenization of methodologies for different tasks, domains and modalities
  • Current research in weather and climate science and ML has focused on designing separate models for every task, but recent works have proposed pretraining techniques for satellite imagery and remote sensing

Approach

  • Aim to build a generalizable deep learning foundation model
  • Model needs to be able to input heterogeneous datasets of different variables
  • Model needs to provide spatio-temporal coverage based on physical groundings

Input representation

  • Model takes an input of shape × × and predicts an output of shape ′ × ′ × ′
  • refers to the number of input variables, such as weather conditions and climate forcing factors
  • and refer to the spatial resolution of the input data
  • ′ , ′ , ′ refer to the variables and spatial resolution of the predicted outputs
  • Mainly work with two spatial resolutions: 5.625°(32 × 64 grid points) and 1.40625°(128 × 256 grid points)

Model architecture

  • Aim to design a foundation model that can be pretrained on heterogeneous data sources and finetuned to solve various downstream weather and climate tasks.
  • Tasks can be thought of as image-to-image translation problems with input and output channels.
  • Image architectures such as UNet, ResNet, and Vision Transformers (ViT) are natural fits.
  • Climate and weather tasks require more flexibility than current CNN-based architectures can provide.
  • Build ClimaX architecture upon Vision Transformers (ViT).
  • Propose two major architectural changes: variable tokenization and variable aggregation.
  • Variable tokenization tokenizes each variable in the input separately.
  • Variable aggregation performs a cross-attention operation for each spatial position.
  • Use a standard Vision Transformer (ViT) for generating output tokens.

Datasets

Pretraining

  • ClimaX is a computer model used to predict future weather conditions given current conditions.
  • During pretraining, the lead time (how far into the future the model is predicting) is randomized from 6 hours to 168 hours (1 week).
  • The lead time is added to the tokens to inform the model of how long it is forecasting.
  • The ERA5 reanalysis data is used for finetuning and evaluation for various weather related downstream tasks.

Finetuning

  • ClimaX has four learnable components
  • Evaluate performance of ClimaX on various downstream tasks
  • Downstream tasks categorized into two finetuning scenarios
  • First scenario: finetune entire model
  • Second scenario: replace embedding layers and prediction head with newly initialized networks, finetune or freeze other two components

Experiments

  • Evaluate performance and generality of ClimaX on downstream tasks
  • Compare ClimaX performance to current state-of-the-art NWP system
  • Analyze scaling property of ClimaX with increasing data size, model capacity, and data resolution
  • Perform ablation studies to understand trade-off between computation and performance

Neural baselines

  • Compared ClimaX with IFS, the current gold standard in weather forecasting
  • Compared with UNet and ResNet, two CNN baselines commonly used in vision tasks
  • Borrowed ResNet architecture from Weatherbench

Forecasting

  • We want to forecast the weather at a future time given global weather conditions.
  • There are 48 input variables in total.
  • We are predicting four target variables: geopotential at 500hPa, temperature at 850hPa, temperature at 2 meters from the ground, and zonal wind speed at 10 meters from the ground.
  • We are considering seven lead times: 6 hours, 1, 3, 5, 7 days, 2 weeks, and 1 month.
  • We are comparing ClimaX with IFS and two CNN baselines on the ERA5 dataset at 5.625°and 1.40625°resolutions.
  • We are using latitude-weighted MSE loss and evaluating the best checkpoint on the test set.
  • We are comparing all methods on latitude-weighted root mean squared error (RMSE) and latitude-weighted anomaly correlation coefficient (ACC).
  • We are evaluating ClimaX on regional forecasting of relevant variables in North America.
  • We are comparing ClimaX with two CNN baselines and the scratch-trained version of ClimaX.
  • We are evaluating ClimaX on S2S prediction of four target variables: T850, T2m, U10, and V10.
  • ClimaX achieves the lowest error for all variables.

Climate projection

  • ClimaX is tested on ClimateBench, a benchmark designed for testing machine learning models for climate projections
  • Goal is to predict annual mean global distributions of surface temperature, diurnal temperature range, precipitation, and the 90th percentile of precipitation
  • Input variables are four anthropogenic forcing factors: carbon dioxide (CO 2 ), sulfur dioxide (SO 2 ), black carbon (BC), and methane (CH 4 )
  • Finetuning pipeline of ClimaX for ClimateBench includes replacing pretrained embedding layers and prediction heads with newly initialized networks, while keeping the attention layers and the variable aggregation module
  • Two finetuning protocols are considered: freezing or finetuning the attention layers
  • Two components are added to the pipeline: history of the preceding ten years of the forcing factors and global average pooling layer
  • Mean squared error is used as the loss function
  • Results show that ClimaX frozen performs the best in predicting two temperature-related variables, followed by ClimaX

Climate model downscaling

  • Climate models are often run at coarse grids due to their high computational cost.
  • Downscaling aims to obtain higher-resolution projections and reduce biases from the outputs of these models.
  • ClimaX achieves the lowest RMSE and a mean bias closest to 0 for all three target variables.
  • ClimaX has successfully captured the spatial structure of weather data.

Scaling laws analysis

  • Transformers have favorable scaling properties
  • Performance improves with data size and model capacity
  • Figure 11 shows performance of ClimaX as a function of data size and model capacity
  • Error rate of two biggest models decreases with increasing data and model size
  • Larger models are more data efficient
  • High-resolution data contains finer details and local processes
  • Figure 12 compares performance of ClimaX on 5.625° and 1.40625° data
  • ClimaX-all-vars achieves comparable performance to ClimaX
  • ClimaX-iter works reasonably well up to 1-day prediction
  • ClimaX-cont performs competitively on 6-hour to 7-day forecasting
  • Finetuning cost scales linearly with number of target variables and lead times
  • ClimaX is most expensive, ClimaX-iter is cheapest
  • Performance is proportional to computational cost
  • Scaling has transformative impact in AI subdisciplines
  • ClimaX opens up new opportunities for scaling
  • Future research could explore incorporating observational and simulated datasets
  • Future research could explore single multi-scale architectures