2021-10-18 Iterative methods¶
Last time¶
Sparse direct solvers
matrix orderings
impact on formulation
cost scaling
Discussion with Ioana Fleming
Today¶
Why iterative solvers
Stationary iterative methods
Preconditioning
using Plots
using LinearAlgebra
using SparseArrays
default(linewidth=4)
function advdiff_matrix(n; kappa=1, wind=[0, 0])
"Advection-diffusion with Dirichlet boundary conditions eliminated"
h = 2 / (n + 1)
rows = Vector{Int64}()
cols = Vector{Int64}()
vals = Vector{Float64}()
idx((i, j),) = (i-1)*n + j
in_domain((i, j),) = 1 <= i <= n && 1 <= j <= n
stencil_advect = [-wind[1], -wind[2], 0, wind[1], wind[2]] / h
stencil_diffuse = [-1, -1, 4, -1, -1] * kappa / h^2
stencil = stencil_advect + stencil_diffuse
for i in 1:n
for j in 1:n
neighbors = [(i-1, j), (i, j-1), (i, j), (i+1, j), (i, j+1)]
mask = in_domain.(neighbors)
append!(rows, idx.(repeat([(i,j)], 5))[mask])
append!(cols, idx.(neighbors)[mask])
append!(vals, stencil[mask])
end
end
sparse(rows, cols, vals)
end
function my_spy(A)
cmax = norm(vec(A), Inf)
s = max(1, ceil(120 / size(A, 1)))
spy(A, marker=(:square, s), c=:diverging_rainbow_bgymr_45_85_c67_n256, clims=(-cmax, cmax))
end
my_spy (generic function with 1 method)
Why iterative solvers over direct solvers?¶
Less reliable, more leaky abstraction
More sensitive to problem formulation
Slower for small problems
Several different strategies, each with tuning knobs
Accuracy tolerances needed
\(O(N)\) solvers available for many important problems¶
High-order discretization can be okay
Gradient descent¶
Suppose \(A\) is a symmetric positive definite matrix and consider the scalar functional
Then the gradient is
x = LinRange(-4, 4, 40)
A = [1 0; 0 3]
b = [1, 1]
f(u) = .5 * u' * A * u - b' * u
contour(x, x, (u1, u2) -> f([u1, u2]), aspect_ratio=:equal)
Aside: Derivative of a dot product¶
Let \(f(\boldsymbol x) = \boldsymbol y^T \boldsymbol x = \sum_i y_i x_i\) and compute the derivative
Note that \(\boldsymbol y^T \boldsymbol x = \boldsymbol x^T \boldsymbol y\) and we have the product rule,
Also,
Aside: Variational notation¶
It’s convenient to express derivatives in terms of how they act on an infinitessimal perturbation. So we might write
(It’s common to use \(\delta x\) or \(dx\) for these infinitesimals.) This makes inner products look like a normal product rule
A powerful example of variational notation is differentiating a matrix inverse
and thus
Try gradient descent¶
function grad_descent(loss, grad, u0; omega=1e-3, tol=1e-5)
"""Minimize loss(c) via gradient descent with initial guess u0
using learning rate gamma. Declares convergence when gradient
is less than tol or after 500 steps.
"""
u = copy(u0)
uhist = [copy(u)]
lhist = [loss(u)]
for it in 1:500
g = grad(u)
u -= omega * g
push!(uhist, copy(u))
push!(lhist, loss(u))
if norm(g) < tol
break
end
end
(u, hcat(uhist...), lhist)
end
grad_descent (generic function with 1 method)
Visualize gradient descent¶
A = [1 1; 1 4]
loss(u) = .5 * u' * A * u
grad(u) = A * u
u, uhist, lhist = grad_descent(loss, grad, [.9, .9],
omega=.48)
plot(lhist, yscale=:log10)
plot(uhist[1, :], uhist[2, :], marker=:circle)
x = LinRange(-1, 1, 30)
contour!(x, x, (x,y) -> loss([x, y]))
Richardson iteration¶
The simplest iterative method is Richardson’s method, which solves \(A u = b\) by the iteration
Algebraically equivalent to gradient descent when \(A\) is SPD
Non-symmetric matrices are harder to visualize
Richardson convergence¶
If \(b = A u_*\), this iteration is equivalent to
It is convenient for convergence analysis to identify the “error” \(e_k = u_k - u_*\), in which this becomes
Suppose that the eigendecomposition
Not all matrices are diagonalizable.
The matrix \(X\) may be very ill-conditioned.
Aside: Schur decomposition¶
We can repair these weaknesses by using the Schur decomposition
Where are the eigenvalues in \(R\)?
Evidently we must find \(\omega\) to minimize the maximum eigenvalue of \(I - \omega A\). We can do this if \(A\) is well conditioned, but not in general.
Ill-conditioning¶
Question: What is the condition number of the Laplacian on 100 evenly spaced points?¶
How does it scale under grid refinement?
A = advdiff_matrix(10)
cond(Matrix(A))
48.37415007870825
omega = .01
ev = eigvals(Matrix(I - omega * A))
scatter(real.(ev), imag.(ev))
Monic polynomials small on the spectrum¶
Equivalently to finding \(\omega\) such that \(\lVert I - \omega A \rVert\) is minimized, we may seek a monic polynomial \(p(z) = 1 - \omega z\) that minimizes
This concept can be extended to higher degree polynomials, which is essentially what Krylov methods do (discovering the polynomial adaptively, weighted by the right hand side).
ev = eigvals(Matrix(A))
scatter(real.(ev), zero.(ev))
plot!(x -> 1 - omega * x)
Preconditioning¶
Preconditioning is the act of creating an “affordable” operation “\(P^{-1}\)” such that \(P^{-1} A\) (or \(A P^{-1}\)) is is well-conditoned or otherwise has a “nice” spectrum. We then solve the system
in which case the convergence rate depends on the spectrum of the iteration matrix
The preconditioner must be applied on each iteration.
It is not merely about finding a good initial guess.
There are two complementary techniques necessary for efficient iterative methods:
“accelerators” or Krylov methods, which use orthogonality to adaptively converge faster than Richardson
preconditioners that improve the spectrum of the preconditioned operator
Although there is ongoing research in Krylov methods and they are immensely useful, I would say preconditioning is 90% of the game for practical applications, particularly as a research area.