7.4 Data Assimilation
Data assimilation is the mathematical discipline of optimally combining imperfect model forecasts (the background) with noisy, incomplete, irregularly distributed observations to produce the best possible estimate of the atmospheric state -- the analysis. This analysis serves as the initial condition for the next forecast. Data assimilation is widely considered the single most important component of the NWP system; the skill of modern forecasts owes as much to assimilation advances as to model improvements. Modern operational systems assimilate tens of millions of observations every 6-12 hours from radiosondes, satellites, aircraft, radar, buoys, and GPS receivers.
Bayesian Framework
Data assimilation is fundamentally a Bayesian estimation problem. Given a prior (the background forecast) and new data (observations), we seek the posterior (the analysis):
$$p(\mathbf{x}|\mathbf{y}) \propto p(\mathbf{y}|\mathbf{x})\,p(\mathbf{x})$$
Posterior $\propto$ Likelihood $\times$ Prior. For Gaussian errors, the MAP estimate minimizes a quadratic cost function.
Background Error Covariance B
$\mathbf{B} = \overline{(\mathbf{x}^b - \mathbf{x}^t)(\mathbf{x}^b - \mathbf{x}^t)^T}$. Dimension $\sim 10^8 \times 10^8$ -- never stored explicitly. Encodes error variances and spatial correlations. Determines how far observations spread influence. Estimated from NMC method or ensembles.
Observation Error Covariance R
$\mathbf{R} = \overline{(\mathbf{y}^o - H(\mathbf{x}^t))(\mathbf{y}^o - H(\mathbf{x}^t))^T}$. Includes instrument, representativeness, and forward model errors. Often assumed diagonal, though satellite inter-channel correlations are increasingly accounted for.
Optimal Interpolation (OI)
OI (Gandin 1963) was the operational standard from the 1970s-1990s. It computes the analysis using observations within a local influence region:
$$\mathbf{x}^a = \mathbf{x}^b + \mathbf{K}(\mathbf{y}^o - H(\mathbf{x}^b))$$
$$\mathbf{K} = \mathbf{BH}^T(\mathbf{HBH}^T + \mathbf{R})^{-1}$$
Analysis = Background + Kalman Gain $\times$ Innovation. The gain K minimizes analysis error variance: $\mathbf{A} = (\mathbf{I} - \mathbf{KH})\mathbf{B}$.
OI is conceptually simple and computationally inexpensive (only local matrix inversions, typically 30-50 nearest observations per analysis point), but B is static and prescribed. Superseded by variational and ensemble methods for operational use.
3D-Var: Three-Dimensional Variational Assimilation
3D-Var finds the analysis by minimizing a cost function measuring weighted distance from both background and observations:
$$J(\mathbf{x}) = \frac{1}{2}(\mathbf{x} - \mathbf{x}^b)^T \mathbf{B}^{-1}(\mathbf{x} - \mathbf{x}^b) + \frac{1}{2}(\mathbf{y}^o - H(\mathbf{x}))^T \mathbf{R}^{-1}(\mathbf{y}^o - H(\mathbf{x}))$$
$J_b$ (background penalty) + $J_o$ (observation penalty). Minimum satisfies $\nabla J = 0$.
The gradient: $\nabla J = \mathbf{B}^{-1}(\mathbf{x} - \mathbf{x}^b) - \mathbf{H}^T\mathbf{R}^{-1}(\mathbf{y}^o - H(\mathbf{x}))$ where $\mathbf{H} = \partial H/\partial \mathbf{x}$ is the Jacobian. Minimized iteratively using conjugate gradient or L-BFGS (30-100 iterations). Equivalent to OI when H is linear. B specified via NMC method or ensemble estimates.
4D-Var: Four-Dimensional Variational Assimilation
4D-Var extends 3D-Var over a time window, using the forecast model as a strong constraint:
$$J(\mathbf{x}_0) = \frac{1}{2}(\mathbf{x}_0 - \mathbf{x}^b)^T \mathbf{B}^{-1}(\mathbf{x}_0 - \mathbf{x}^b) + \frac{1}{2}\sum_{i=0}^{N}(\mathbf{y}_i - H_i(\mathbf{x}_i))^T \mathbf{R}_i^{-1}(\mathbf{y}_i - H_i(\mathbf{x}_i))$$
subject to $\mathbf{x}_{i+1} = M_{i \to i+1}(\mathbf{x}_i)$. Requires the adjoint model $\mathbf{M}^T$ integrated backward for gradient computation.
Tangent Linear Model
$\delta\mathbf{x}_i = \mathbf{M}_{0 \to i}\,\delta\mathbf{x}_0$. Propagates perturbations forward. Valid for small perturbations over limited time (6-12 hours). Must be linearized from all model processes including dynamics and physics.
Adjoint Model
Transpose $\mathbf{M}^T$ integrated backward. Propagates gradient information from observations back to initial time. Major software engineering effort. ECMWF 4D-Var uses 12-hour windows with incremental approach at reduced resolution (T159-T399).
4D-Var implicitly evolves B in time, providing flow-dependent error statistics. The incremental approach (Courtier et al. 1994): minimize for increment $\delta\mathbf{x} = \mathbf{x}^a - \mathbf{x}^b$ at lower resolution, then add to full-resolution background.
Ensemble Kalman Filter (EnKF)
The EnKF (Evensen 1994) estimates B from an ensemble of forecasts, making it flow-dependent without requiring an adjoint model:
$$\mathbf{P}^b \approx \frac{1}{K-1}\sum_{k=1}^{K}(\mathbf{x}_k^b - \bar{\mathbf{x}}^b)(\mathbf{x}_k^b - \bar{\mathbf{x}}^b)^T$$
$$\mathbf{x}_k^a = \mathbf{x}_k^b + \mathbf{K}(\mathbf{y}_k^o - H(\mathbf{x}_k^b)), \quad \mathbf{K} = \mathbf{P}^b\mathbf{H}^T(\mathbf{HP}^b\mathbf{H}^T + \mathbf{R})^{-1}$$
Sample covariance from K members (typically 20-80). Perturbed observations: $\mathbf{y}_k^o = \mathbf{y}^o + \boldsymbol{\epsilon}_k$, $\boldsymbol{\epsilon}_k \sim N(0,\mathbf{R})$
Localization (Gaspari-Cohn)
With K $\ll$ state dimension, sample covariance has spurious long-range correlations. Gaspari-Cohn (1999) function tapers covariances to zero beyond cutoff (~500-2000 km horizontal, ~1 scale height vertical). Applied via Schur (element-wise) product: $\mathbf{P}_{loc}^b = \boldsymbol{\rho} \circ \mathbf{P}^b$.
Inflation
Spread collapse corrected via multiplicative inflation: $\mathbf{x}_k^b \leftarrow \bar{\mathbf{x}}^b + \alpha(\mathbf{x}_k^b - \bar{\mathbf{x}}^b)$ with $\alpha \approx 1.01$-1.10, or additive inflation (random perturbations from climatological B). Adaptive inflation (Anderson 2007) estimates $\alpha$ from innovation statistics.
Hybrid Ensemble-Variational Methods
$$\mathbf{B}_{hybrid} = \beta_1\mathbf{B}_{static} + \beta_2\mathbf{B}_{ensemble}$$
$\beta_1 + \beta_2 = 1$, typically $\beta_2 \approx 0.5$-0.8. Static B provides baseline; ensemble B provides flow-dependent features.
Hybrid 3DEnVar (NCEP GFS)
3D-Var with B blended from static climatological covariance and GDAS EnKF ensemble (80 members). Extended control variables incorporate ensemble covariance in the variational minimization.
Hybrid 4DEnVar (ECMWF)
Ensemble provides temporal evolution of covariances across the window without requiring an adjoint. ECMWF uses 50-member EDA (Ensemble of Data Assimilations) to supplement static B in 4D-Var. State of the art.
Observation Types and Quality Control
Radiosondes
~900 stations globally, 2x daily. T, RH, wind profiles to ~30 hPa. Gold standard for calibration. Sparse over oceans and Southern Hemisphere. Observation operator: spatial interpolation.
Satellite Radiances
$H(\mathbf{x}) = \int B(\nu,T(z))\frac{\partial\tau}{\partial z}dz + \epsilon B(\nu,T_s)\tau_s$. AMSU, IASI, CrIS, ATMS. ~60% of total forecast impact. Requires radiative transfer model (RTTOV, CRTM) and bias correction (VarBC).
GPS Radio Occultation
Refractivity: $N = 77.6\frac{p}{T} + 3.73 \times 10^5\frac{e}{T^2}$. No calibration needed, all-weather. COSMIC-2 provides 5000+ profiles/day. Disproportionately large per-observation impact.
Aircraft (AMDAR/ACARS)
~300,000 reports/day. Temperature and wind at cruise altitude, profiles during ascent/descent. High temporal resolution along flight routes. 15-20% of total forecast impact in Northern Hemisphere.
Quality Control
Background and Buddy Checks
Background check: reject if $|y^o - H(x^b)| > n\sqrt{\sigma_b^2 + \sigma_o^2}$ (n=3-5). Buddy check: compare each observation against nearby observations; reject spatial outliers. Innovation should be drawn from $N(0, \mathbf{HBH}^T + \mathbf{R})$.
Variational QC and Bias Correction
VarQC: Huber norm combines $L_2$ for small innovations with $L_1$ for outliers, providing robustness without hard rejection. VarBC (Dee 2005): estimates satellite bias predictors (scan position, airmass, skin temperature) within the minimization itself.
Fortran: Optimal Interpolation for 2D Temperature Field
Performs OI analysis of a 2D temperature field given a uniform background and scattered noisy observations, demonstrating distance-dependent Gaussian correlation functions and local matrix inversion.
program oi_2d_temperature
implicit none
integer, parameter :: dp = selected_real_kind(15, 307)
integer, parameter :: nx = 41, ny = 41, nobs = 20, mloc = 8
real(dp), parameter :: Lx = 500.0_dp, Ly = 500.0_dp
real(dp), parameter :: Lcorr = 100.0_dp
real(dp), parameter :: sig_b = 2.0_dp, sig_o = 1.0_dp
real(dp) :: xg(nx), yg(ny), T_bg(nx,ny), T_an(nx,ny), T_true(nx,ny)
real(dp) :: xo(nobs), yo(nobs), To(nobs)
real(dp) :: BHt(mloc), HBHtR(mloc,mloc), innov(mloc), wt(mloc)
real(dp) :: dx, dy, dist, xa, ya, incr, rmse_b, rmse_a, u1, u2, z
integer :: i, j, k, m, nl, idx(mloc)
real(dp) :: dists(nobs)
dx = Lx/real(nx-1,dp); dy = Ly/real(ny-1,dp)
do i = 1, nx; xg(i) = (i-1)*dx; end do
do j = 1, ny; yg(j) = (j-1)*dy; end do
! True field: warm + cold anomalies
do j = 1, ny
do i = 1, nx
T_true(i,j) = 280.0_dp &
+ 5.0_dp*exp(-((xg(i)-200.0_dp)**2+(yg(j)-300.0_dp)**2)/(80.0_dp**2)) &
- 3.0_dp*exp(-((xg(i)-350.0_dp)**2+(yg(j)-150.0_dp)**2)/(60.0_dp**2))
end do
end do
T_bg = 280.0_dp ! Uniform background
! Generate observations
call random_seed()
do k = 1, nobs
call random_number(u1); xo(k) = u1*Lx
call random_number(u1); yo(k) = u1*Ly
i = max(1,min(nx,nint(xo(k)/dx)+1))
j = max(1,min(ny,nint(yo(k)/dy)+1))
call random_number(u1); call random_number(u2)
z = sqrt(-2.0_dp*log(u1))*cos(6.2832_dp*u2)
To(k) = T_true(i,j) + sig_o*z
end do
! OI analysis
do j = 1, ny
do i = 1, nx
xa = xg(i); ya = yg(j)
do k = 1, nobs
dists(k) = sqrt((xo(k)-xa)**2 + (yo(k)-ya)**2)
end do
! Select nearest mloc obs
nl = min(nobs, mloc)
call select_near(dists, nobs, nl, idx)
! Build matrices
do k = 1, nl
dist = dists(idx(k))
BHt(k) = sig_b**2 * exp(-0.5_dp*(dist/Lcorr)**2)
m = idx(k)
innov(k) = To(m) - T_bg(max(1,min(nx,nint(xo(m)/dx)+1)), &
max(1,min(ny,nint(yo(m)/dy)+1)))
end do
do k = 1, nl
do m = 1, nl
dist = sqrt((xo(idx(k))-xo(idx(m)))**2 + (yo(idx(k))-yo(idx(m)))**2)
HBHtR(k,m) = sig_b**2*exp(-0.5_dp*(dist/Lcorr)**2)
if (k==m) HBHtR(k,m) = HBHtR(k,m) + sig_o**2
end do
end do
call solve_sys(HBHtR, innov, wt, nl, mloc)
incr = 0.0_dp
do k = 1, nl; incr = incr + BHt(k)*wt(k); end do
T_an(i,j) = T_bg(i,j) + incr
end do
end do
rmse_b = 0.0_dp; rmse_a = 0.0_dp
do j = 1, ny
do i = 1, nx
rmse_b = rmse_b + (T_bg(i,j)-T_true(i,j))**2
rmse_a = rmse_a + (T_an(i,j)-T_true(i,j))**2
end do
end do
rmse_b = sqrt(rmse_b/real(nx*ny,dp))
rmse_a = sqrt(rmse_a/real(nx*ny,dp))
write(*,'(A,F8.4,A)') 'Background RMSE: ', rmse_b, ' K'
write(*,'(A,F8.4,A)') 'Analysis RMSE: ', rmse_a, ' K'
write(*,'(A,F6.1,A)') 'Improvement: ', (1.0_dp-rmse_a/rmse_b)*100.0_dp,' %'
contains
subroutine select_near(d, n, k, idx)
integer, intent(in) :: n, k; real(dp), intent(in) :: d(n)
integer, intent(out) :: idx(k)
real(dp) :: dc(n); integer :: i, j, imin
dc = d
do i = 1, k
imin = 1
do j = 2, n; if (dc(j)<dc(imin)) imin = j; end do
idx(i) = imin; dc(imin) = huge(1.0_dp)
end do
end subroutine
subroutine solve_sys(A, b, x, n, nd)
integer, intent(in) :: n, nd
real(dp), intent(inout) :: A(nd,nd), b(nd)
real(dp), intent(out) :: x(nd)
real(dp) :: fac; integer :: i, j, k
do k = 1, n
if (abs(A(k,k))<1e-12_dp) then; x(k)=0.0_dp; cycle; end if
do i = k+1, n
fac = A(i,k)/A(k,k); A(i,k:n) = A(i,k:n)-fac*A(k,k:n)
b(i) = b(i)-fac*b(k)
end do
end do
do i = n, 1, -1
x(i) = b(i)
do j = i+1, n; x(i) = x(i)-A(i,j)*x(j); end do
if (abs(A(i,i))>1e-12_dp) then; x(i)=x(i)/A(i,i)
else; x(i)=0.0_dp; end if
end do
end subroutine
end program oi_2d_temperatureInteractive Simulation: Kalman Filter Weather Update
PythonDemonstrates a 1D Kalman filter for temperature analysis. Combines a background forecast with sparse noisy observations to produce an optimal analysis, showing how the analysis is a weighted average based on error covariances with uncertainty bands.
Click Run to execute the Python code
Code will be executed with Python 3 on the server