INVGN - Gauss-Newton inversion, version 1.0.
by Andrew Ganse, Applied Physics Laboratory, University of Washington, Seattle.
Contact at
Copyright (c) 2015, University of Washington, via 3-clause BSD license.
See LICENSE.txt for full license statement.


INVGN: Calculate Tikhonov-regularized, Gauss-Newton nonlinear iterated inversion
to solve the damped nonlinear least squares problem:
minimize ||g(m)-d||^2_2 + lambda^2||Lm||^2_2
For appropriate choices of regularization parameter lambda, this problem is
equivalent to:
minimize ||Lm||_2 subject to ||g(m)-d||_2 where delta is some statistically-determined noise threshold
and also to:
minimize ||g(m)-d||_2 subject to ||Lm||_2 where epsilon is some model-norm threshold

The function call is set up to allow use on both nonlinear and linear problems,
both regularized (inverse) and non-regularized (parameter estimation) problems,
and both frequentist and Bayesian problems. See EXAMPLES below for these calls.
The dampled NLS regularization is accomplished with the L-curve method (see e.g.
Hansen 1987).

Code based on material from:

* Professor Ken Creager's ESS-523 inverse theory class, Univ of Washington, 2005
* RC Aster, B Borchers, & CH Thurber, "Parameter Estimation and Inverse
Problems," Elsevier Academic Press, 2004.
* A Tarantola, "Inverse Problem Theory and Methods for Model Parameter
Estimation", Society for Industrial and Applied Mathematics, 2005.
* PC Hansen, "Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects
of Linear Inversion," Society for Industrial Mathematics, 1987.
* PE Gill, W Murray, & MH Wright, "Practical Optimization," Academic Press, 1981.

INVGN is compatible with Octave (version >3), but to handle a few minor
remaining differences between Matlab and Octave, do note the "usingOctave"
flag among the options listed at top of the InvGN.m script. Perhaps in future
versions INVGN will allow these options to be set via keyword/value pairs in the
input argument list, but not implemented yet.

[m,normdm,normmisfit,normrough,lambda,dpred,G,C,R,Ddiag,i_conv] = ...

INPUTS: (note all vectors are always column vectors)
fwdfunc = Matlab/Octave function handle to forward problem function which
is of form outvector=myfunction(invector,...). (See example 1
of "help function_handle" in Matlab for how to implement)
derivfunc = Matlab/Octave function handle to derivatives function which
is of form outvector=myfunction(invector,...).
If derivfunc=[] then finitediffs are automatically used instead.
epsr = relative accuracy of forward problem computation (not the
accuracy of the modeling of the physical process, but of the
computation itself, see e.g. Gill,Murray,Wright 1981). Affects
convergence testing and is passed into finite diff function (if
used), in both cases providing stepsize h (h is about sqrt(epsr)
if the model params are all of order unity). Epsr has a minimum
at the tradeoff between degrading Talor series expansion accuracy
if h too large, and increasing roundoff error in the numerator
G(m+h)-G(m) of finite diffs if h too small.
Epsr can be scalar (ie same for all elements of model vector) or
vector of same length as model vector for 1-to-1 correspondence
to different variable "types" (eg layer bulk properties vs layer
dmeas = measured data vector (includes noise)
covd = if sqr matrix, then this is measurement noise covariance matrix.
if col vector, then this is diag of meas noise covariance matrix.
if neg, then = -covd^(-1/2), allowing e.g. use of zeros for a mask,
so note in that case it's the inverse of the STDEVS (sqrt(cov)).
lb = lower model bounds vector (hard bnds, no tolerance)
ub = upper model bounds vector (hard bnds, no tolerance)
These bounds are only implemented at the iteration steps after
solving for the model perturbation without them; ie functions
such as BVLS or LSQNONNEG are not used. If the bounds are seen
to be passed in a candidate model perturbation, the stepsize of
the perturbation is reduced (but retaining its direction) until
within bounds. As long as we know the final solution must be
within the bounds and not on them, this is merely like changing
the starting point of inversion and is valid (perhaps somtimes
not as efficient as one of those more complicated functions).
But if the solution from this code ends up on a bound, note you
should treat that solution with some caution and not consider it
a true GN solution. In my own inverse problems using this code
the bounds only ever engage at the tails of my Lcurve where I'm
not too worried about being rigorous, so I haven't found it
necessary to complicate things with those other functions.
m0 = initial model vector estimate to start iterating from; if this
is instead a matrix (NxM), then it's a collection of M initial
model vectors of length N each, corresponding to M lambda values
(Typically this is done to continue iterations on previous runs
without starting over.)
Alternate use: if L (next input arg) = I (identity matrix), then
this is also the "preferred model" whose distance from the soln
model is minimized.
L = If this is a matrix, then this is the user-supplied finite
difference operator for Tikhonov regularization (function
finiteDiffOP.m can help construct it) in the frequentist
framework. If in the Bayesian framework and lambda is set to 1,
then L can be supplied as the Cholesky decomposition of the
inverse model prior covariance matrix.
If L is scalar then it is the number of profile segments (for
example if the model vector represents concatenated P & S wave
speed profiles then Lin=2) and 2nd order Tikhonov (smoothness)
frequentist regularization will be used.
lambda = Used for the frequentist inversion framework (if using Bayesian
framework then set lambda=1 and see Bayesian note for L above).
If lambda is a negative scalar, auto calculate Lcurve with num
pts equal to magnitude of the negative scalar. The auto-calc'd
curve starts at lambda_mid = sqrt(max(eig(G'*G))/max(eig(L'*L)))
and decreases logarithmically from there to 3 orders of mag
less and more than lambda_mid by default. If lambda >=0, use
its value(s) directly as frequentist Tikhonov tradeoff param(s),
in which case lambda can be scalar or vector.
If lambda is a 3-element vector, lambda(1) is the pos or neg scalar
just described; and lambda(2:3) are how many orders of magnitude
greater (lambda(2)) or less (lambda(3)) than lambda_mid to sweep
in the auto-calc'd curve. (lambda(2:3) defaults are 3,-3)
useMpref = flag for one of three problem frameworks:
useMpref=0 is for a frequentist formulation with higher-order
Tikhonov regularization. Here the regularization doesn't
handle distance from preferred (initial) model m0, but no
restriction on L (unlike useMpref=1).
useMpref=1 is for using 0th-order-Tikh (ridge regr), ie
L==eye(size(L)), in a frequentist formulation in which m0 is
the "preferred model" the inversion minimizes distance to.
useMpref=2 is for using Bayesian formulation, in which case:
m0 is the prior model mean, lambda must =1, L=chol(inv(Cprior))
(user must set that L externally, note ironically L there must
be RIGHT triangular, the "L" notation is from frequentist case),
output C is now the Bayesian posterior covariance matrix (which
note is different from the frequentist C), outputs R and Ddiag
are empty, and normrough won't have literal meaning anymore.
maxiters = Number of GN iterations to perform at each lambda on the Lcurve.
So total number of derivative estimations will be
verbose = 0-3, with 0 = run quietly, and 3 = lots of progress text
Progress text (percent done) in findiffs is useful for slow
forward problems and/or debugging - but note percent done text
is unavailable in jacfwddist and jaccendist.
fid = File handle to which progress text is output in verbose option.
Note fid=1 outputs to stdout, but an output file is useful for
long batch runs when the forward problem is slow.
vargin = the extra arguments (if any) that must be passed to the fwdprob
function are tacked onto the end here. You may have none, in
which case your last arg is the fid.)

OUTPUTS: (note all vectors are always column vectors)
m = matrix of maximum likelihood model solution vectors for each
lambda. So m is MxP for M model params and P points on Lcurve,
ie each column of m is a model vector for a different lambda.
normdm = vectors of L2 norms of model perturbation steps "dm" for each
lambda. (so normdm is imax x P, with i steps to convergence
for l'th lambda, imax=max(i(l)), and P points on Lcurve). Unused
elements in normdm are set to -1, since norms can't be negative.
normmisfit = vector of L2 norms of data misfit between measured and
predicted data at solution point for each lambda.
normrough = vector of L2 norms of roughness (for which units are 2nd deriv)
at solution points for each lambda. "Roughness" refers to case
of 2nd order Tikhonov (smoothing constraint) but note really here
this norm's meaning depends on what the supplied L matrix is
(eg might be other model norms than roughness). And in Bayesian
case (useMpref=2) this doesn't have same meaning anymore although
it's still part of the objective function as it relates to Cprior.
lambda = vector of tradeoff parameters used for Tikhonov regularization.
(you might already have this if you passed in vector of lambdas
as one of the input arguments, but this is useful in the case of
specifying to auto-calculate N Lcurve points).
dpred = predicted data at solution models (so there are as many vectors
in dpred as there are points on the Lcurve, so dpred is NxP, for
N data pts and P points on Lcurve).
G = Jacobian matrices of partial derivatives at solution models
(so there are as many jacobian matrices in G as there are points
on the Lcurve, so G is NxMxP, for N data pts, M model params,
and P points on Lcurve).
C = if useMpref<2, C contains the frequentist covariance matrices of
solution models (so there are as many cov matrices in C as there
are points on the Lcurve, so C is MxMxP, for M model params and
P points on Lcurve). If useMpref==2 (flagging Bayesian case),
then C is instead the Bayesian posterior model covariance matrix.
R = resolution matrices of solution models (so there are as many cov
matrices in C as there are points on the Lcurve, so C is MxMxP,
for M model params and P points on Lcurve).
(R is empty for Bayesian case when useMpref=2.)
Ddiag = column vector of diagonal of data res matrix (length Ndata)
(Ddiag is empty for Bayesian case when useMpref=2.)
iconv = column vector of number of iterations before convergence at each
lambda. if didn't converge, the iconv value for that lambda
is 0. (length Ndata).

For more clarity in the code, this function computes its model perturbations via
(regularized) normal equations, rather than generalized SVD or subiterations of
gradient descent, which may be more efficient and/or scalable. The assumption
in this choice was that forward function code takes much longer to compute than
the model perturbation solution itself (which can otherwise be computed faster
via GSVD). If that assumption is not true, then much improvement in compute
time of this function would result from solving for the perturbation using GSVD
methods such as those provided in Per Christian Hansen's toolkit (however note
that toolkit would take a lot of tweaking/paring to make it Octave compatible):
Aside from computational efficiency/speed, another important consequence of
solving normal equations in this code is that it constrains the size of the
problem (number of model parameters) relative to amount of computer memory due
to calculation of a matrix inverse. If useful to note I have successfully run
this code for over 10,000 model parameters on a computer with 16GB of RAM, and
for over 2500 model parameters on a computer with 8GB RAM (in both those cases
the matrix inverse was the memory constraint rather than forward problem).

(leaving off output part "[var1,var2,etc]=" which is same format for all):

Regularized frequentist inversion with auto-chosen lambdas (numlambdas of them):

Regularized frequentist inversion with prechosen lambdas:
[1.04193e+07; 10419.3; 10.4193; 0.0104193;],0,maxiters,3,1);

Regu inv, picking up after previous run (using lambda vector from prev run):

Parameter estimation (no regularization):

Frequentist linear problem (should converge after 1st iter), w/ smoothing regu:
deriv=@(x) A; fwdp=@(x) A*x; % define funcs using the A matrix inline
(Note maxiters=2 there: actually the first "iter" in saved results is for the
initial value, which admittedly isn't relevant for purely linear problems, but
that's how this nonlinear script fits purely linear problems. Yes definitely
an inefficient kludge.)

Frequentist nonlinear problem with preferred model:

Bayesian nonlinear problem: