7.4 Data Assimilation

Data assimilation is the mathematical discipline of optimally combining imperfect model forecasts (the background) with noisy, incomplete, irregularly distributed observations to produce the best possible estimate of the atmospheric state -- the analysis. This analysis serves as the initial condition for the next forecast. Data assimilation is widely considered the single most important component of the NWP system; the skill of modern forecasts owes as much to assimilation advances as to model improvements. Modern operational systems assimilate tens of millions of observations every 6-12 hours from radiosondes, satellites, aircraft, radar, buoys, and GPS receivers.

Bayesian Framework

Data assimilation is fundamentally a Bayesian estimation problem. Given a prior (the background forecast) and new data (observations), we seek the posterior (the analysis):

$$p(\mathbf{x}|\mathbf{y}) \propto p(\mathbf{y}|\mathbf{x})\,p(\mathbf{x})$$

Posterior $\propto$ Likelihood $\times$ Prior. For Gaussian errors, the MAP estimate minimizes a quadratic cost function.

Background Error Covariance B

$\mathbf{B} = \overline{(\mathbf{x}^b - \mathbf{x}^t)(\mathbf{x}^b - \mathbf{x}^t)^T}$. Dimension $\sim 10^8 \times 10^8$ -- never stored explicitly. Encodes error variances and spatial correlations. Determines how far observations spread influence. Estimated from NMC method or ensembles.

Observation Error Covariance R

$\mathbf{R} = \overline{(\mathbf{y}^o - H(\mathbf{x}^t))(\mathbf{y}^o - H(\mathbf{x}^t))^T}$. Includes instrument, representativeness, and forward model errors. Often assumed diagonal, though satellite inter-channel correlations are increasingly accounted for.

Optimal Interpolation (OI)

OI (Gandin 1963) was the operational standard from the 1970s-1990s. It computes the analysis using observations within a local influence region:

$$\mathbf{x}^a = \mathbf{x}^b + \mathbf{K}(\mathbf{y}^o - H(\mathbf{x}^b))$$

$$\mathbf{K} = \mathbf{BH}^T(\mathbf{HBH}^T + \mathbf{R})^{-1}$$

Analysis = Background + Kalman Gain $\times$ Innovation. The gain K minimizes analysis error variance: $\mathbf{A} = (\mathbf{I} - \mathbf{KH})\mathbf{B}$.

OI is conceptually simple and computationally inexpensive (only local matrix inversions, typically 30-50 nearest observations per analysis point), but B is static and prescribed. Superseded by variational and ensemble methods for operational use.

3D-Var: Three-Dimensional Variational Assimilation

3D-Var finds the analysis by minimizing a cost function measuring weighted distance from both background and observations:

$$J(\mathbf{x}) = \frac{1}{2}(\mathbf{x} - \mathbf{x}^b)^T \mathbf{B}^{-1}(\mathbf{x} - \mathbf{x}^b) + \frac{1}{2}(\mathbf{y}^o - H(\mathbf{x}))^T \mathbf{R}^{-1}(\mathbf{y}^o - H(\mathbf{x}))$$

$J_b$ (background penalty) + $J_o$ (observation penalty). Minimum satisfies $\nabla J = 0$.

The gradient: $\nabla J = \mathbf{B}^{-1}(\mathbf{x} - \mathbf{x}^b) - \mathbf{H}^T\mathbf{R}^{-1}(\mathbf{y}^o - H(\mathbf{x}))$ where $\mathbf{H} = \partial H/\partial \mathbf{x}$ is the Jacobian. Minimized iteratively using conjugate gradient or L-BFGS (30-100 iterations). Equivalent to OI when H is linear. B specified via NMC method or ensemble estimates.

4D-Var: Four-Dimensional Variational Assimilation

4D-Var extends 3D-Var over a time window, using the forecast model as a strong constraint:

$$J(\mathbf{x}_0) = \frac{1}{2}(\mathbf{x}_0 - \mathbf{x}^b)^T \mathbf{B}^{-1}(\mathbf{x}_0 - \mathbf{x}^b) + \frac{1}{2}\sum_{i=0}^{N}(\mathbf{y}_i - H_i(\mathbf{x}_i))^T \mathbf{R}_i^{-1}(\mathbf{y}_i - H_i(\mathbf{x}_i))$$

subject to $\mathbf{x}_{i+1} = M_{i \to i+1}(\mathbf{x}_i)$. Requires the adjoint model $\mathbf{M}^T$ integrated backward for gradient computation.

Tangent Linear Model

$\delta\mathbf{x}_i = \mathbf{M}_{0 \to i}\,\delta\mathbf{x}_0$. Propagates perturbations forward. Valid for small perturbations over limited time (6-12 hours). Must be linearized from all model processes including dynamics and physics.

Adjoint Model

Transpose $\mathbf{M}^T$ integrated backward. Propagates gradient information from observations back to initial time. Major software engineering effort. ECMWF 4D-Var uses 12-hour windows with incremental approach at reduced resolution (T159-T399).

4D-Var implicitly evolves B in time, providing flow-dependent error statistics. The incremental approach (Courtier et al. 1994): minimize for increment $\delta\mathbf{x} = \mathbf{x}^a - \mathbf{x}^b$ at lower resolution, then add to full-resolution background.

Ensemble Kalman Filter (EnKF)

The EnKF (Evensen 1994) estimates B from an ensemble of forecasts, making it flow-dependent without requiring an adjoint model:

$$\mathbf{P}^b \approx \frac{1}{K-1}\sum_{k=1}^{K}(\mathbf{x}_k^b - \bar{\mathbf{x}}^b)(\mathbf{x}_k^b - \bar{\mathbf{x}}^b)^T$$

$$\mathbf{x}_k^a = \mathbf{x}_k^b + \mathbf{K}(\mathbf{y}_k^o - H(\mathbf{x}_k^b)), \quad \mathbf{K} = \mathbf{P}^b\mathbf{H}^T(\mathbf{HP}^b\mathbf{H}^T + \mathbf{R})^{-1}$$

Sample covariance from K members (typically 20-80). Perturbed observations: $\mathbf{y}_k^o = \mathbf{y}^o + \boldsymbol{\epsilon}_k$, $\boldsymbol{\epsilon}_k \sim N(0,\mathbf{R})$

Localization (Gaspari-Cohn)

With K $\ll$ state dimension, sample covariance has spurious long-range correlations. Gaspari-Cohn (1999) function tapers covariances to zero beyond cutoff (~500-2000 km horizontal, ~1 scale height vertical). Applied via Schur (element-wise) product: $\mathbf{P}_{loc}^b = \boldsymbol{\rho} \circ \mathbf{P}^b$.

Inflation

Spread collapse corrected via multiplicative inflation: $\mathbf{x}_k^b \leftarrow \bar{\mathbf{x}}^b + \alpha(\mathbf{x}_k^b - \bar{\mathbf{x}}^b)$ with $\alpha \approx 1.01$-1.10, or additive inflation (random perturbations from climatological B). Adaptive inflation (Anderson 2007) estimates $\alpha$ from innovation statistics.

Hybrid Ensemble-Variational Methods

$$\mathbf{B}_{hybrid} = \beta_1\mathbf{B}_{static} + \beta_2\mathbf{B}_{ensemble}$$

$\beta_1 + \beta_2 = 1$, typically $\beta_2 \approx 0.5$-0.8. Static B provides baseline; ensemble B provides flow-dependent features.

Hybrid 3DEnVar (NCEP GFS)

3D-Var with B blended from static climatological covariance and GDAS EnKF ensemble (80 members). Extended control variables incorporate ensemble covariance in the variational minimization.

Hybrid 4DEnVar (ECMWF)

Ensemble provides temporal evolution of covariances across the window without requiring an adjoint. ECMWF uses 50-member EDA (Ensemble of Data Assimilations) to supplement static B in 4D-Var. State of the art.

Observation Types and Quality Control

Radiosondes

~900 stations globally, 2x daily. T, RH, wind profiles to ~30 hPa. Gold standard for calibration. Sparse over oceans and Southern Hemisphere. Observation operator: spatial interpolation.

Satellite Radiances

$H(\mathbf{x}) = \int B(\nu,T(z))\frac{\partial\tau}{\partial z}dz + \epsilon B(\nu,T_s)\tau_s$. AMSU, IASI, CrIS, ATMS. ~60% of total forecast impact. Requires radiative transfer model (RTTOV, CRTM) and bias correction (VarBC).

GPS Radio Occultation

Refractivity: $N = 77.6\frac{p}{T} + 3.73 \times 10^5\frac{e}{T^2}$. No calibration needed, all-weather. COSMIC-2 provides 5000+ profiles/day. Disproportionately large per-observation impact.

Aircraft (AMDAR/ACARS)

~300,000 reports/day. Temperature and wind at cruise altitude, profiles during ascent/descent. High temporal resolution along flight routes. 15-20% of total forecast impact in Northern Hemisphere.

Quality Control

Background and Buddy Checks

Background check: reject if $|y^o - H(x^b)| > n\sqrt{\sigma_b^2 + \sigma_o^2}$ (n=3-5). Buddy check: compare each observation against nearby observations; reject spatial outliers. Innovation should be drawn from $N(0, \mathbf{HBH}^T + \mathbf{R})$.

Variational QC and Bias Correction

VarQC: Huber norm combines $L_2$ for small innovations with $L_1$ for outliers, providing robustness without hard rejection. VarBC (Dee 2005): estimates satellite bias predictors (scan position, airmass, skin temperature) within the minimization itself.

Fortran: Optimal Interpolation for 2D Temperature Field

Performs OI analysis of a 2D temperature field given a uniform background and scattered noisy observations, demonstrating distance-dependent Gaussian correlation functions and local matrix inversion.

program oi_2d_temperature
  implicit none
  integer, parameter :: dp = selected_real_kind(15, 307)
  integer, parameter :: nx = 41, ny = 41, nobs = 20, mloc = 8
  real(dp), parameter :: Lx = 500.0_dp, Ly = 500.0_dp
  real(dp), parameter :: Lcorr = 100.0_dp
  real(dp), parameter :: sig_b = 2.0_dp, sig_o = 1.0_dp

  real(dp) :: xg(nx), yg(ny), T_bg(nx,ny), T_an(nx,ny), T_true(nx,ny)
  real(dp) :: xo(nobs), yo(nobs), To(nobs)
  real(dp) :: BHt(mloc), HBHtR(mloc,mloc), innov(mloc), wt(mloc)
  real(dp) :: dx, dy, dist, xa, ya, incr, rmse_b, rmse_a, u1, u2, z
  integer  :: i, j, k, m, nl, idx(mloc)
  real(dp) :: dists(nobs)

  dx = Lx/real(nx-1,dp); dy = Ly/real(ny-1,dp)
  do i = 1, nx; xg(i) = (i-1)*dx; end do
  do j = 1, ny; yg(j) = (j-1)*dy; end do

  ! True field: warm + cold anomalies
  do j = 1, ny
    do i = 1, nx
      T_true(i,j) = 280.0_dp &
        + 5.0_dp*exp(-((xg(i)-200.0_dp)**2+(yg(j)-300.0_dp)**2)/(80.0_dp**2)) &
        - 3.0_dp*exp(-((xg(i)-350.0_dp)**2+(yg(j)-150.0_dp)**2)/(60.0_dp**2))
    end do
  end do
  T_bg = 280.0_dp  ! Uniform background

  ! Generate observations
  call random_seed()
  do k = 1, nobs
    call random_number(u1); xo(k) = u1*Lx
    call random_number(u1); yo(k) = u1*Ly
    i = max(1,min(nx,nint(xo(k)/dx)+1))
    j = max(1,min(ny,nint(yo(k)/dy)+1))
    call random_number(u1); call random_number(u2)
    z = sqrt(-2.0_dp*log(u1))*cos(6.2832_dp*u2)
    To(k) = T_true(i,j) + sig_o*z
  end do

  ! OI analysis
  do j = 1, ny
    do i = 1, nx
      xa = xg(i); ya = yg(j)
      do k = 1, nobs
        dists(k) = sqrt((xo(k)-xa)**2 + (yo(k)-ya)**2)
      end do
      ! Select nearest mloc obs
      nl = min(nobs, mloc)
      call select_near(dists, nobs, nl, idx)
      ! Build matrices
      do k = 1, nl
        dist = dists(idx(k))
        BHt(k) = sig_b**2 * exp(-0.5_dp*(dist/Lcorr)**2)
        m = idx(k)
        innov(k) = To(m) - T_bg(max(1,min(nx,nint(xo(m)/dx)+1)), &
                                max(1,min(ny,nint(yo(m)/dy)+1)))
      end do
      do k = 1, nl
        do m = 1, nl
          dist = sqrt((xo(idx(k))-xo(idx(m)))**2 + (yo(idx(k))-yo(idx(m)))**2)
          HBHtR(k,m) = sig_b**2*exp(-0.5_dp*(dist/Lcorr)**2)
          if (k==m) HBHtR(k,m) = HBHtR(k,m) + sig_o**2
        end do
      end do
      call solve_sys(HBHtR, innov, wt, nl, mloc)
      incr = 0.0_dp
      do k = 1, nl; incr = incr + BHt(k)*wt(k); end do
      T_an(i,j) = T_bg(i,j) + incr
    end do
  end do

  rmse_b = 0.0_dp; rmse_a = 0.0_dp
  do j = 1, ny
    do i = 1, nx
      rmse_b = rmse_b + (T_bg(i,j)-T_true(i,j))**2
      rmse_a = rmse_a + (T_an(i,j)-T_true(i,j))**2
    end do
  end do
  rmse_b = sqrt(rmse_b/real(nx*ny,dp))
  rmse_a = sqrt(rmse_a/real(nx*ny,dp))
  write(*,'(A,F8.4,A)') 'Background RMSE: ', rmse_b, ' K'
  write(*,'(A,F8.4,A)') 'Analysis RMSE:   ', rmse_a, ' K'
  write(*,'(A,F6.1,A)') 'Improvement:     ', (1.0_dp-rmse_a/rmse_b)*100.0_dp,' %'

contains
  subroutine select_near(d, n, k, idx)
    integer, intent(in) :: n, k; real(dp), intent(in) :: d(n)
    integer, intent(out) :: idx(k)
    real(dp) :: dc(n); integer :: i, j, imin
    dc = d
    do i = 1, k
      imin = 1
      do j = 2, n; if (dc(j)<dc(imin)) imin = j; end do
      idx(i) = imin; dc(imin) = huge(1.0_dp)
    end do
  end subroutine

  subroutine solve_sys(A, b, x, n, nd)
    integer, intent(in) :: n, nd
    real(dp), intent(inout) :: A(nd,nd), b(nd)
    real(dp), intent(out) :: x(nd)
    real(dp) :: fac; integer :: i, j, k
    do k = 1, n
      if (abs(A(k,k))<1e-12_dp) then; x(k)=0.0_dp; cycle; end if
      do i = k+1, n
        fac = A(i,k)/A(k,k); A(i,k:n) = A(i,k:n)-fac*A(k,k:n)
        b(i) = b(i)-fac*b(k)
      end do
    end do
    do i = n, 1, -1
      x(i) = b(i)
      do j = i+1, n; x(i) = x(i)-A(i,j)*x(j); end do
      if (abs(A(i,i))>1e-12_dp) then; x(i)=x(i)/A(i,i)
      else; x(i)=0.0_dp; end if
    end do
  end subroutine
end program oi_2d_temperature

Interactive Simulation: Kalman Filter Weather Update

Python

Demonstrates a 1D Kalman filter for temperature analysis. Combines a background forecast with sparse noisy observations to produce an optimal analysis, showing how the analysis is a weighted average based on error covariances with uncertainty bands.

kalman_filter_weather.py94 lines

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n_grid = 50
x = np.linspace(0, 500, n_grid)  # km

# True temperature field (sine wave with a warm anomaly)
T_true = 280 + 5 * np.sin(2 * np.pi * x / 500) + 3 * np.exp(-((x - 200)**2) / (40**2))

# Background forecast (biased, smoother)
sigma_b = 2.0  # background error std dev (K)
T_bg = 280 + 4 * np.sin(2 * np.pi * x / 500) + np.random.normal(0, 0.5, n_grid)

# Observations (sparse, noisy samples of truth)
n_obs = 12
obs_idx = np.sort(np.random.choice(n_grid, n_obs, replace=False))
sigma_o = 1.0  # observation error std dev (K)
T_obs = T_true[obs_idx] + np.random.normal(0, sigma_o, n_obs)

# Build background error covariance B (Gaussian correlation)
L_corr = 50.0  # correlation length (km)
B = np.zeros((n_grid, n_grid))
for i in range(n_grid):
    for j in range(n_grid):
        B[i, j] = sigma_b**2 * np.exp(-0.5 * ((x[i] - x[j]) / L_corr)**2)

# Observation operator H (selects observed grid points)
H = np.zeros((n_obs, n_grid))
for k in range(n_obs):
    H[k, obs_idx[k]] = 1.0

# Observation error covariance R
R = sigma_o**2 * np.eye(n_obs)

# Kalman gain: K = B H^T (H B H^T + R)^{-1}
HBHT = H @ B @ H.T
K = B @ H.T @ np.linalg.inv(HBHT + R)

# Innovation (observation minus background at obs locations)
d = T_obs - H @ T_bg

# Analysis: x_a = x_b + K * d
T_an = T_bg + K @ d

# Analysis error covariance
A = (np.eye(n_grid) - K @ H) @ B
sigma_a = np.sqrt(np.diag(A))
sigma_bg = np.sqrt(np.diag(B))

# Plotting
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 9), gridspec_kw={'height_ratios': [3, 1]})
fig.patch.set_facecolor('#0f172a')
for ax in [ax1, ax2]:
    ax.set_facecolor('#1e293b')
    ax.tick_params(colors='#cbd5e1')
    for spine in ax.spines.values():
        spine.set_color('#334155')
    ax.grid(True, color='#334155', alpha=0.4)

# Temperature field
ax1.fill_between(x, T_bg - 2*sigma_bg, T_bg + 2*sigma_bg, alpha=0.15, color='#3b82f6', label='Background 2sigma')
ax1.fill_between(x, T_an - 2*sigma_a, T_an + 2*sigma_a, alpha=0.2, color='#22c55e', label='Analysis 2sigma')
ax1.plot(x, T_true, '--', color='#f97316', lw=2, label='Truth')
ax1.plot(x, T_bg, color='#3b82f6', lw=1.5, label='Background')
ax1.plot(x, T_an, color='#22c55e', lw=2, label='Analysis')
ax1.scatter(x[obs_idx], T_obs, color='#ef4444', s=60, zorder=5, edgecolors='white', linewidth=0.5, label='Observations')
ax1.set_ylabel('Temperature (K)', color='#cbd5e1')
ax1.set_title('Kalman Filter: Weather Data Assimilation', color='#e2e8f0', fontweight='bold', fontsize=14)
ax1.legend(fontsize=9, facecolor='#1e293b', edgecolor='#334155', labelcolor='#cbd5e1')

# Error reduction
ax2.plot(x, sigma_bg, color='#3b82f6', lw=1.5, label='Background error')
ax2.plot(x, sigma_a, color='#22c55e', lw=2, label='Analysis error')
ax2.scatter(x[obs_idx], np.zeros(n_obs) + 0.1, color='#ef4444', marker='|', s=100, label='Obs locations')
ax2.set_xlabel('Distance (km)', color='#cbd5e1')
ax2.set_ylabel('Error Std Dev (K)', color='#cbd5e1')
ax2.set_title('Error Reduction by Assimilation', color='#e2e8f0', fontweight='bold')
ax2.legend(fontsize=9, facecolor='#1e293b', edgecolor='#334155', labelcolor='#cbd5e1')

plt.tight_layout()
plt.savefig('output.png', dpi=150, bbox_inches='tight', facecolor=fig.get_facecolor())

rmse_bg = np.sqrt(np.mean((T_bg - T_true)**2))
rmse_an = np.sqrt(np.mean((T_an - T_true)**2))
print("=== Kalman Filter Data Assimilation Results ===")
print(f"Background RMSE: {rmse_bg:.3f} K")
print(f"Analysis RMSE:   {rmse_an:.3f} K")
print(f"Error reduction:  {(1 - rmse_an/rmse_bg)*100:.1f}%")
print(f"Mean background error std: {sigma_bg.mean():.3f} K")
print(f"Mean analysis error std:   {sigma_a.mean():.3f} K")
print(f"\nThe analysis is a weighted average of background and observations,")
print(f"with weights determined by their respective error covariances.")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

← 7.3 Numerical Prediction Part 8: Atmospheric Chemistry →

Share:X Reddit LinkedIn