# 2021-10-18 Iterative methods¶

## Last time¶

Sparse direct solvers

matrix orderings

impact on formulation

cost scaling

Discussion with Ioana Fleming

## Today¶

Why iterative solvers

Stationary iterative methods

Preconditioning

```
using Plots
using LinearAlgebra
using SparseArrays
default(linewidth=4)
function advdiff_matrix(n; kappa=1, wind=[0, 0])
"Advection-diffusion with Dirichlet boundary conditions eliminated"
h = 2 / (n + 1)
rows = Vector{Int64}()
cols = Vector{Int64}()
vals = Vector{Float64}()
idx((i, j),) = (i-1)*n + j
in_domain((i, j),) = 1 <= i <= n && 1 <= j <= n
stencil_advect = [-wind[1], -wind[2], 0, wind[1], wind[2]] / h
stencil_diffuse = [-1, -1, 4, -1, -1] * kappa / h^2
stencil = stencil_advect + stencil_diffuse
for i in 1:n
for j in 1:n
neighbors = [(i-1, j), (i, j-1), (i, j), (i+1, j), (i, j+1)]
mask = in_domain.(neighbors)
append!(rows, idx.(repeat([(i,j)], 5))[mask])
append!(cols, idx.(neighbors)[mask])
append!(vals, stencil[mask])
end
end
sparse(rows, cols, vals)
end
function my_spy(A)
cmax = norm(vec(A), Inf)
s = max(1, ceil(120 / size(A, 1)))
spy(A, marker=(:square, s), c=:diverging_rainbow_bgymr_45_85_c67_n256, clims=(-cmax, cmax))
end
```

```
my_spy (generic function with 1 method)
```

# Why iterative solvers over direct solvers?¶

Less reliable, more leaky abstraction

More sensitive to problem formulation

Slower for small problems

Several different strategies, each with tuning knobs

Accuracy tolerances needed

## \(O(N)\) solvers available for many important problems¶

High-order discretization

*can*be okay

# Gradient descent¶

Suppose \(A\) is a symmetric positive definite matrix and consider the scalar functional

Then the gradient is

```
x = LinRange(-4, 4, 40)
A = [1 0; 0 3]
b = [1, 1]
f(u) = .5 * u' * A * u - b' * u
contour(x, x, (u1, u2) -> f([u1, u2]), aspect_ratio=:equal)
```

# Aside: Derivative of a dot product¶

Let \(f(\boldsymbol x) = \boldsymbol y^T \boldsymbol x = \sum_i y_i x_i\) and compute the derivative

Note that \(\boldsymbol y^T \boldsymbol x = \boldsymbol x^T \boldsymbol y\) and we have the product rule,

Also,

# Aside: Variational notation¶

It’s convenient to express derivatives in terms of how they act on an infinitessimal perturbation. So we might write

(It’s common to use \(\delta x\) or \(dx\) for these infinitesimals.) This makes inner products look like a normal product rule

A powerful example of variational notation is differentiating a matrix inverse

and thus

# Try gradient descent¶

```
function grad_descent(loss, grad, u0; omega=1e-3, tol=1e-5)
"""Minimize loss(c) via gradient descent with initial guess u0
using learning rate gamma. Declares convergence when gradient
is less than tol or after 500 steps.
"""
u = copy(u0)
uhist = [copy(u)]
lhist = [loss(u)]
for it in 1:500
g = grad(u)
u -= omega * g
push!(uhist, copy(u))
push!(lhist, loss(u))
if norm(g) < tol
break
end
end
(u, hcat(uhist...), lhist)
end
```

```
grad_descent (generic function with 1 method)
```

# Visualize gradient descent¶

```
A = [1 1; 1 4]
loss(u) = .5 * u' * A * u
grad(u) = A * u
u, uhist, lhist = grad_descent(loss, grad, [.9, .9],
omega=.48)
plot(lhist, yscale=:log10)
```

```
plot(uhist[1, :], uhist[2, :], marker=:circle)
x = LinRange(-1, 1, 30)
contour!(x, x, (x,y) -> loss([x, y]))
```

# Richardson iteration¶

The simplest iterative method is Richardson’s method, which solves \(A u = b\) by the iteration

Algebraically equivalent to gradient descent when \(A\) is SPD

Non-symmetric matrices are harder to visualize

# Richardson convergence¶

If \(b = A u_*\), this iteration is equivalent to

It is convenient for convergence analysis to identify the “error” \(e_k = u_k - u_*\), in which this becomes

*iteration matrix*\(I - \omega A\) tell the whole story.

Suppose that the eigendecomposition

Not all matrices are diagonalizable.

The matrix \(X\) may be very ill-conditioned.

# Aside: Schur decomposition¶

We can repair these weaknesses by using the Schur decomposition

Where are the eigenvalues in \(R\)?

Evidently we must find \(\omega\) to minimize the maximum eigenvalue of \(I - \omega A\). We can do this if \(A\) is well conditioned, but not in general.

# Ill-conditioning¶

## Question: What is the condition number of the Laplacian on 100 evenly spaced points?¶

How does it scale under grid refinement?

```
A = advdiff_matrix(10)
cond(Matrix(A))
```

```
48.37415007870825
```

```
omega = .01
ev = eigvals(Matrix(I - omega * A))
scatter(real.(ev), imag.(ev))
```

# Monic polynomials small on the spectrum¶

Equivalently to finding \(\omega\) such that \(\lVert I - \omega A \rVert\) is minimized, we may seek a monic polynomial \(p(z) = 1 - \omega z\) that minimizes

This concept can be extended to higher degree polynomials, which is essentially what Krylov methods do (discovering the polynomial adaptively, weighted by the right hand side).

```
ev = eigvals(Matrix(A))
scatter(real.(ev), zero.(ev))
plot!(x -> 1 - omega * x)
```

# Preconditioning¶

Preconditioning is the act of creating an “affordable” operation “\(P^{-1}\)” such that \(P^{-1} A\) (or \(A P^{-1}\)) is is well-conditoned or otherwise has a “nice” spectrum. We then solve the system

in which case the convergence rate depends on the spectrum of the iteration matrix

The preconditioner must be applied on each iteration.

It is

*not*merely about finding a good initial guess.

There are two complementary techniques necessary for efficient iterative methods:

“accelerators” or Krylov methods, which use orthogonality to adaptively converge faster than Richardson

preconditioners that improve the spectrum of the preconditioned operator

Although there is ongoing research in Krylov methods and they are immensely useful, I would say preconditioning is 90% of the game for practical applications, particularly as a research area.