Direct Methods and Matrix Factorizations (Julia)

Machine Learning Fundamentals for Economists

Jesse Perla

University of British Columbia

Overview

Motivation

  • In preparation for the ML lectures we cover some core numerical linear algebra concepts
  • Many of these are directly useful
    • e.g. solving large systems of equations or as building blocks in bigger algorithms

Packages and Materials

using LinearAlgebra, Statistics, BenchmarkTools, SparseArrays, Random
using Plots
Random.seed!(42);  # seed random numbers for reproducibility
Precompiling packages...
    534.4 msStatistics
  1 dependency successfully precompiled in 1 seconds
Precompiling packages...
   1697.6 msStructUtils
   5338.0 msJSON
   1287.8 msBenchmarkTools
  3 dependencies successfully precompiled in 8 seconds. 14 already precompiled.
Precompiling packages...
   1124.8 msQuartoNotebookWorkerJSONExt (serial)
  1 dependency successfully precompiled in 1 seconds
Precompiling packages...
    816.6 msLaTeXStrings
   2832.1 msFormat
   2044.5 msRequires
   3649.4 msOrderedCollections
   1351.8 msStatistics → SparseArraysExt
   1319.6 msJpegTurbo_jll
   2166.0 msGhostscript_jll
   5460.2 msMacroTools
   4314.2 msLatexify
    730.3 msLatexify → SparseArraysExt
  10 dependencies successfully precompiled in 13 seconds. 17 already precompiled.
Precompiling packages...
    724.5 msStatsAPI
    870.9 msMeasures
   1023.6 msContour
   1203.3 msDocStringExtensions
   1221.0 msURIs
    932.1 msSimpleBufferStream
    796.6 msPtrArrays
   4132.7 msGrisu
    881.2 msDataAPI
    735.1 msReexport
   1188.8 msTranscodingStreams
   3934.2 msNaNMath
   5717.5 msIrrationalConstants
    667.3 msBitFlags
    977.4 msTensorCore
    923.9 msStableRNGs
   2046.4 msConcurrentUtilities
   1052.4 msDelimitedFiles
   3867.2 msUnzip
   1016.6 msLoggingExtras
   1292.1 msExceptionUnwrapping
   8106.7 msFixedPointNumbers
   4187.2 msDataStructures
   1339.8 msXorg_libICE_jll
   2716.4 msUnicodeFun
   1531.2 msLibffi_jll
   1276.2 msLibuuid_jll
   1426.0 msLLVMOpenMP_jll
   1369.3 msfzf_jll
   1359.9 msOgg_jll
   1390.2 msmtdev_jll
   1333.3 msx265_jll
   1403.5 msx264_jll
   1198.8 msGraphite2_jll
   1685.9 msFriBidi_jll
   1145.2 msEpollShim_jll
   1251.9 msXorg_libXau_jll
   1466.9 mslibpng_jll
   1365.6 msLAME_jll
   1387.5 msXorg_libXdmcp_jll
   1561.8 mslibaom_jll
   1576.6 msMbedTLS_jll
   1655.3 msZstd_jll
   1527.3 msExpat_jll
   1173.3 msLZO_jll
   1406.1 msOpus_jll
   1312.6 msXorg_xtrans_jll
   1425.7 mseudev_jll
   1336.6 msLibmount_jll
   1284.0 mslibfdk_aac_jll
   1464.8 msBzip2_jll
   1322.4 msLERC_jll
   1461.0 mslibevdev_jll
   1633.9 msXZ_jll
    966.5 msAliasTables
   1743.9 msGettextRuntime_jll
    786.8 msShowoff
   1134.5 msCodecZlib
   3994.2 msRecipesBase
   1472.0 msLogExpFunctions
   3396.7 msMissings
   1201.0 msSortingAlgorithms
   1263.9 msXorg_libSM_jll
   4620.2 msOpenSSL
   1536.5 msPixman_jll
   1362.2 msJLFzf
   5131.9 msColorTypes
   1520.9 mslibvorbis_jll
   1463.8 msDbus_jll
   2493.8 msMbedTLS
   4264.3 msXorg_libxcb_jll
   1556.1 msFreeType2_jll
   1881.5 msWayland_jll
   1391.5 msLibtiff_jll
   1516.5 mslibinput_jll
    864.5 msColorTypes → StyledStringsExt
   2182.4 msGlib_jll
   6838.2 msColorVectorSpace
   8448.2 msStatsBase
   1158.6 msXorg_xcb_util_jll
   1333.2 msXorg_libX11_jll
   1973.5 msFontconfig_jll
   1401.5 msXorg_xcb_util_keysyms_jll
  13317.7 msColors
   1238.8 msXorg_xcb_util_image_jll
   1617.7 msXorg_xcb_util_renderutil_jll
   1569.0 msXorg_xcb_util_wm_jll
   1467.9 msXorg_libXrender_jll
   1341.5 msXorg_libXext_jll
   1504.9 msXorg_libXfixes_jll
   1392.8 msXorg_libxkbfile_jll
   1369.0 msXorg_xcb_util_cursor_jll
   1283.9 msXorg_libXinerama_jll
   1600.0 msXorg_libXrandr_jll
   1803.9 msLibglvnd_jll
   1764.0 msCairo_jll
   9046.6 msColorSchemes
   1326.7 msXorg_libXcursor_jll
   1418.7 msXorg_libXi_jll
   1252.0 msXorg_xkbcomp_jll
   2120.2 msHarfBuzz_jll
   1601.1 msXorg_xkeyboard_config_jll
   1565.6 mslibass_jll
   1898.8 msPango_jll
   1449.1 msxkbcommon_jll
   2878.9 msFFMPEG_jll
   1497.0 msVulkan_Loader_jll
   1563.3 mslibdecor_jll
   1212.1 msFFMPEG
   3831.3 msQt6Base_jll
   1741.6 msGLFW_jll
   1668.5 msQt6ShaderTools_jll
   2341.2 msGR_jll
  29804.3 msPlotUtils
   6487.4 msQt6Declarative_jll
   6982.6 msPlotThemes
  59833.2 msHTTP
   1350.8 msQt6Wayland_jll
  10193.2 msRecipesPipeline
   6219.5 msGR
  81886.5 msPlots
  121 dependencies successfully precompiled in 203 seconds. 56 already precompiled.
Precompiling packages...
   2491.7 msQuartoNotebookWorkerPlotsExt (serial)
  1 dependency successfully precompiled in 3 seconds

Complexity

Basic Computational Complexity

Big-O Notation

For a function \(f(N)\) and a positive constant \(C\), we say \(f(N)\) is \(\mathrm{O}(g(N))\), if there exist positive constants \(C\) and \(N_0\) such that:

\[ 0 \leq f(N) \leq C \cdot g(N) \quad \text{for all } N \geq N_0 \]

  • Often crucial to know how problems scale asymptotically (as \(N\to\infty\))
  • Caution! This is only an asymptotic limit, and can be misleading for small \(N\)
    • \(f_1(N) = N^3 + N\) is \(\mathrm{O}(N^3)\)
    • \(f_2(N) = 1000 N^2 + 3 N\) is \(\mathrm{O}(N^2)\)
    • For roughly \(N>1000\) use \(f_2\) algorithm, otherwise \(f_1\)

Examples of Computational Complexity

  • Simple examples:
    • \(x \cdot y = \sum_{n=1}^N x_n y_n\) is \(\mathrm{O}(N)\) since it requires \(N\) multiplications and additions
    • \(A x\) for \(A\in\mathbb{R}^{N\times N},x\in\mathbb{R}^N\) is \(\mathrm{O}(N^2)\) since it requires \(N\) dot products, each \(\mathrm{O}(N)\)

Computational Complexity

Ask yourself whether the following is a computationally expensive operation as the matrix size increases

  • Multiplying two matrices?
    • Answer: It depends. Multiplying two diagonal matrices is trivial.
  • Solving a linear system of equations?
    • Answer: It depends. If the matrix is the identity, the solution is the vector itself.
  • Finding the eigenvalues of a matrix?
    • Answer: It depends. The eigenvalues of a triangular matrix are the diagonal elements.

Numerical Precision

Machine Epsilon

For a given datatype, \(\epsilon\) is defined as \(\epsilon = \min_{\delta > 0} \left\{ \delta : 1 + \delta > 1 \right\}\)

  • Computers have finite precision. 64-bit typical, but 32-bit on GPUs
machine epsilon for float64 = 2.220446049250313e-16
1 + eps/2 == 1? true
machine epsilon for float32 = 1.1920929e-7

Matrix Structure

Matrix Structure

  • A key principle is to ensure you don’t lose “structure”
    • e.g. if sparse, operations should keep it sparse if possible
    • If triangular, then use appropriate algorithms instead of converting back to a dense matrix
  • Key structure is:
    • Symmetry, diagonal, tridiagonal, banded, sparse, positive-definite
  • The worse operations for losing structure are matrix multiplication and inversion

Example Losing Sparsity

  • Here the density increases substantially
A = sprand(10, 10, 0.45)  # random sparse 10x10, 45 percent filled with non-zeros

@show nnz(A)  # counts the number of non-zeros
invA = sparse(inv(Array(A)))  # Julia won't invert sparse, so convert to dense with Array.
@show nnz(invA);
nnz(A) = 46
nnz(invA) = 100

Losing Tridiagonal Structure

  • An even more extreme example. Tridiagonal has roughly \(3N\) nonzeros. Inverses are dense \(N^2\)
N = 5
A = Tridiagonal([fill(0.1, N - 2); 0.2], fill(0.8, N), [0.2; fill(0.1, N - 2)])
inv(A)
5×5 Matrix{Float64}:
  1.29099      -0.327957     0.0416667  -0.00537634   0.000672043
 -0.163978      1.31183     -0.166667    0.0215054   -0.00268817
  0.0208333    -0.166667     1.29167    -0.166667     0.0208333
 -0.00268817    0.0215054   -0.166667    1.31183     -0.163978
  0.000672043  -0.00537634   0.0416667  -0.327957     1.29099

Forming the Covariance and/or Gram Matrix

  • Another common example is \(A' A\)
A = sprand(20, 21, 0.3)
@show nnz(A) / 20^2
@show nnz(A' * A) / 21^2;
nnz(A) / 20 ^ 2 = 0.34
nnz(A' * A) / 21 ^ 2 = 0.9229024943310657

Specialized Algorithms

Besides sparsity/storage, the real loss is you miss out on algorithms. For example, lets setup the benchmarking code

using BenchmarkTools
function benchmark_solve(A, b)
    println("A\\b for typeof(A) = $(string(typeof(A)))")
    @btime $A \ $b
end
benchmark_solve (generic function with 1 method)

Compare Dense vs. Sparse vs. Tridiagonal

N = 1000
b = rand(N)
A = Tridiagonal([fill(0.1, N - 2); 0.2], fill(0.8, N), [0.2; fill(0.1, N - 2)])
A_sparse = sparse(A)  # sparse but losing tridiagonal structure
A_dense = Array(A)    # dropping the sparsity structure, dense 1000x1000

# benchmark solution to system A x = b
benchmark_solve(A, b)
benchmark_solve(A_sparse, b)
benchmark_solve(A_dense, b);
A\b for typeof(A) = LinearAlgebra.Tridiagonal{Float64, Vector{Float64}}
  24.456 μs (20 allocations: 47.39 KiB)
A\b for typeof(A) = SparseArrays.SparseMatrixCSC{Float64, Int64}
  521.106 μs (95 allocations: 1.03 MiB)
A\b for typeof(A) = Matrix{Float64}
  18.059 ms (9 allocations: 7.64 MiB)

Triangular Matrices and Back/Forward Substitution

  • A key example of a better algorithm is for triangular matrices
  • Upper or lower triangular matrices can be solved in \(\mathrm{O}(N^2)\) instead of \(\mathrm{O}(N^3)\)
b = [1.0, 2.0, 3.0]
U = UpperTriangular([1.0 2.0 3.0;
                     0.0 5.0 6.0;
                     0.0 0.0 9.0])
U \ b
3-element Vector{Float64}:
 0.0
 0.0
 0.3333333333333333

Backwards Substitution Example

\[ \begin{aligned} U x &= b\\ U &\equiv \begin{bmatrix} 3 & 1 \\ 0 & 2 \\ \end{bmatrix}, \quad b = \begin{bmatrix} 7 \\ 2 \\ \end{bmatrix} \end{aligned} \]

Solving bottom row for \(x_2\)

\[ 2 x_2 = 2,\quad x_2 = 1 \]

Move up a row, solving for \(x_1\), substituting for \(x_2\)

\[ 3 x_1 + 1 x_2 = 7,\quad 3 x_1 + 1 \times 1 = 7,\quad x_1 = 2 \]

Generalizes to many rows. For \(L\) it is “forward substitution”

Factorizations

Factorizing matrices

  • Just like you can factor \(6 = 2 \cdot 3\), you can factor matrices
  • Unlike integers, you have more choice over the properties of the factors
  • Many operations (e.g., solving systems of equations, finding eigenvalues, inverting, finding determinants) have a factorization done internally
    • Instead you can often just find the factorization and reuse it
  • Key factorizations: LU, QR, Cholesky, SVD, Schur, Eigenvalue

LU(P) Decompositions

  • We can “factor” any square \(A\) into \(P A = L U\) for triangular \(L\) and \(U\). Invertible can have \(A = L U\), called the LU decomposition. “P” is for partial-pivoting
  • Singular matrices may not have full-rank \(L\) or \(U\) matrices
N = 4
A = rand(N, N)
b = rand(N)
# chooses the right factorization based on matrix structure
# LU here
Af = factorize(A)
Af.P * A  Af.L * Af.U
true

Using a Factorization

  • In Julia the factorization objects typically overload the \ and functions such as inv
@show Af \ b
@show inv(Af) * b
Af \ b = [1.567631093835083, -1.8670423474177864, -0.7020922312927874, 1.0653095651070625]
inv(Af) * b = [1.5676310938350828, -1.8670423474177873, -0.7020922312927873, 1.0653095651070625]
4-element Vector{Float64}:
  1.5676310938350828
 -1.8670423474177873
 -0.7020922312927873
  1.0653095651070625

LU Decompositions and Systems of Equations

  • Pivoting is typically implied when talking about “LU”
  • Used in the default solve algorithm (without more structure)
  • Solving systems of equations with triangular matrices: for \(A x = L U x = b\)
    1. Define \(y = U x\)
    2. Solve \(L y = b\) for \(y\) and \(U x = y\) for \(x\)
  • Since both are triangular, process is \(\mathrm{O}(N^2)\) (but LU itself \(\mathrm{O}(N^3)\))
  • Could be used to find inv
    • \(A = L U\) then \(A A^{-1} = I = L U A^{-1} = I\)
    • Solve for \(Y\) in \(L Y = I\), then solve \(U A^{-1} = Y\)
  • Tight connection to textbook Gaussian elimination (including pivoting)

Cholesky

  • LU is for general invertible matrices, but it doesn’t use positive-definiteness or symmetry
  • The Cholesky is the right factorization for general positive-definite matrices. For general symmetric matrices you can use Bunch-Kaufman or others
  • \(A = L L'\) for lower triangular \(L\) or equivalent for upper triangular
N = 500
B = rand(N, N)
A_dense = B' * B  # an easy way to generate a symmetric positive semi-definite matrix
A = Symmetric(A_dense)  # flags the matrix as symmetric
println("A is symmetric? $(issymmetric(A))")
A is symmetric? true

Comparing Cholesky

  • By default it doesn’t know the matrix is positive-definite, so factorize is the best it can do given symmetric structure
b = rand(N)
factorize(A) |> typeof
cholesky(A) \ b  # use the factorization to solve

benchmark_solve(A, b)
benchmark_solve(A_dense, b)
@btime cholesky($A, check = false) \ $b;
A\b for typeof(A) = LinearAlgebra.Symmetric{Float64, Matrix{Float64}}
  1.970 ms (13 allocations: 2.16 MiB)
A\b for typeof(A) = Matrix{Float64}
  2.747 ms (9 allocations: 1.92 MiB)
  1.500 ms (6 allocations: 1.91 MiB)

Eigen Decomposition

  • For square, symmetric, non-singular matrix \(A\) factor into

\[ A = Q \Lambda Q^{-1} \]

  • \(Q\) is a matrix of eigenvectors, \(\Lambda\) is a diagonal matrix of paired eigenvalues
  • For symmetric matrices, the eigenvectors are orthogonal and \(Q^{-1} Q = Q^T Q = I\) which form an orthonormal basis
  • Orthogonal matrices can be thought of as rotations without stretching
  • More general matrices all have a Singular Value Decomposition (SVD)
  • With symmetric \(A\), an interpretation of \(A x\) is that we can first rotate \(x\) into the \(Q\) basis, then stretch by \(\Lambda\), then rotate back

Calculating the Eigen Decomposition

A = Symmetric(rand(5, 5))  # symmetric matrices have real eigenvectors/eigenvalues
A_eig = eigen(A)
Λ = Diagonal(A_eig.values)
Q = A_eig.vectors
@show norm(Q * Λ * inv(Q) - A)
@show norm(Q * Λ * Q' - A);
norm(Q * Λ * inv(Q) - A) = 5.243912693681636e-15
norm(Q * Λ * Q' - A) = 6.347597591257379e-15

Eigendecompositions and Matrix Powers

  • Can be used to find \(A^t\) for large \(t\) (e.g. for Markov chains)
    • \(P^t\), i.e. \(P \cdot P \cdot \ldots \cdot P\) for \(t\) times
    • \(P = Q \Lambda Q^{-1}\) then \(P^t = Q \Lambda^t Q^{-1}\) where \(\Lambda^t\) is just the pointwise power
  • Related can find matrix exponential \(e^A\) for square matrices
    • \(e^A = Q e^\Lambda Q^{-1}\) where \(e^\Lambda\) is just the pointwise exponential
    • Useful for solving differential equations, e.g. \(y' = A y\) for \(y(0) = y_0\) is \(y(t) = e^{A t} y_0\)

More on Factorizations

  • Plenty more used in different circumstances. Start by looking at structure
  • Usually have some connection to textbook algorithms, for example LU is Gaussian elimination with pivoting and QR is Gram-Schmidt Process
  • Just as shortcuts can be done with sparse matrices in textbook examples, direct sparse methods can be faster given enough sparsity
    • But don’t assume sparsity will be faster. Often slower unless matrices are big and especially sparse
    • Dense algorithms on GPUs can be very fast because of parallelism
  • Keep in mind that barring numerical roundoff issues, these are “exact” methods. They don’t become more accurate with more iterations

Large Scale Systems of Equations

  • Packages that solve BIG problems with “direct methods” include MUMPS, Paradiso, UMFPACK, and many others
  • Sparse solvers are bread-and-butter scientific computing, so they can crush huge problems, parallelize on a cluster, etc.
  • But for smaller problems they may not be ideal. Profile and test, and only if you need it.
  • In Julia, the SciML package LinearSolver.jl is your best bet there, as it lets you swap out backends to profile
  • On Python harder to flip between them, but scipy has many built in and many wrappers exist. Same with Matlab

Preview of Conditioning

  • It will turn out that for iterative methods, a different style of algorithm, it is often necessary to multiple by a matrix to transform the problem
  • The ideal transform would be the matrix’s inverse, which requires a full factorization.
  • But instead, you can do only part of the way towards the factorization. e.g., part of the way on gaussian elimination
  • Called “Incomplete Cholesky”, “Incomplete LU”, etc.

Continuous Time Markov Chains

Markov Chains Transitions in in Continuous Time

  • For a discrete number of states, we cannot have instantaneous transitions between states or it ceases to be measurable
  • Instead: intensity of switching from state \(i\) to \(j\) as a \(q_{ij}\) where

\[ \mathbb P \{ X(t + \Delta) = j \,|\, X(t) \} = \begin{cases} q_{ij} \Delta + o(\Delta) & i \neq j\\ 1 + q_{ii} \Delta + o(\Delta) & i = j \end{cases} \]

  • With \(o(\Delta)\) is little-o notation. That is, \(\lim_{\Delta\to 0} o(\Delta)/\Delta = 0\).

Intensity Matrix

  • \(Q_{ij} = q_{ij}\) for \(i \neq j\) and \(Q_{ii} = -\sum_{j \neq i} q_{ij}\)
  • Rows sum to 0
  • For example, consider a counting process

\[ Q = \begin{bmatrix} -0.1 & 0.1 & 0 & 0 & 0 & 0\\ 0.1 &-0.2 & 0.1 & 0 & 0 & 0\\ 0 & 0.1 & -0.2 & 0.1 & 0 & 0\\ 0 & 0 & 0.1 & -0.2 & 0.1 & 0\\ 0 & 0 & 0 & 0.1 & -0.2 & 0.1\\ 0 & 0 & 0 & 0 & 0.1 & -0.1\\ \end{bmatrix} \]

Probability Dynamics

  • The \(Q\) is the infinitesimal generator of the stochastic process.

  • Let \(\pi(t) \in \mathbb{R}^N\) with \(\pi_i(t) \equiv \mathbb{P}[X_t = i\,|\,X_0]\)

  • Then the probability distribution evolution (Fokker-Planck or KFE), is \[ \frac{d}{dt} \pi(t) = \pi(t) Q,\quad \text{ given }\pi(0) \]

  • Or, often written as \(\frac{d}{dt} \pi(t) = Q^{\top} \cdot \pi(t)\), i.e. in terms of the “adjoint” of the linear operator \(Q\)

  • A steady state is then a solution to \(Q^{\top} \cdot \bar{\pi} = 0\)

    • i.e., the \(\bar{\pi}\) left-eigenvector associated with eigenvalue 0 (i.e. \(\bar{\pi} Q = 0\times \bar{\pi}\))

Setting up a Counting Process

alpha = 0.1
N = 6
Q = Tridiagonal(fill(alpha, N - 1),
                [-alpha; fill(-2*alpha, N - 2); -alpha],
                fill(alpha, N - 1))
Q
6×6 Tridiagonal{Float64, Vector{Float64}}:
 -0.1   0.1    ⋅     ⋅     ⋅     ⋅ 
  0.1  -0.2   0.1    ⋅     ⋅     ⋅ 
   ⋅    0.1  -0.2   0.1    ⋅     ⋅ 
   ⋅     ⋅    0.1  -0.2   0.1    ⋅ 
   ⋅     ⋅     ⋅    0.1  -0.2   0.1
   ⋅     ⋅     ⋅     ⋅    0.1  -0.1

Finding the Stationary Distribution

  • There will always be at least one eigenvalue of 0, and the corresponding eigenvector is the stationary distribution
  • Teaser: Do we really need all of the eigenvectors/eigenvalues?
Lambda, vecs = eigen(Array(Q'))
@show Lambda
vecs[:, N] ./ sum(vecs[:, N])
Lambda = [-0.3732050807568874, -0.29999999999999993, -0.19999999999999998, -0.09999999999999995, -0.026794919243112274, 0.0]
6-element Vector{Float64}:
 0.16666666666666657
 0.16666666666666657
 0.1666666666666667
 0.16666666666666682
 0.16666666666666685
 0.16666666666666663

Using the Generator in a Bellman Equation

  • Let \(r \in \mathbb{R}^N\) be a vector of payoffs in each state, and \(\rho > 0\) a discount rate,
  • Then we can use the \(Q\) generator as a simple Bellman Equation (using the Kolmogorov Backwards Equation) to find the value \(v\) in each state,

\[ \rho v = r + Q v \]

  • Rearranging, \((\rho I - Q) v = r\)
  • Teaser: can we just implement \((\rho I - Q)\cdot v\) and avoid factorizing the matrix?

Implementing the Bellman Equation

rho = 0.05
r = range(0.0, 10.0, length=N)
@show typeof(rho*I - Q)

# solve (rho * I - Q) v = r
v = (rho * I - Q) \ r
typeof(rho * I - Q) = LinearAlgebra.Tridiagonal{Float64, Vector{Float64}}
6-element Vector{Float64}:
  38.15384615384615
  57.23076923076923
  84.92307692307693
 115.07692307692311
 142.76923076923077
 161.84615384615384