It'd be something like conjugate gradient, not sgd. Unless A itself is too big t...

nuclearnice1 · on Dec 26, 2019

The method John d Cook is referring to is something like conjugate gradient not sgd?

I’m not too familiar with CG. What clues you in that he’s thinking more CG than SGD?

salty_biscuits · on Dec 26, 2019

It is just "what you do". If it is a small problem the default is qr decomp of A. If you are worried about speed do a cholesky decomp of A'A. If the problem is big (usually because of a sparse A) then you do conjugate gradient (because fill in will bite with a direct method). If it is really, really big (A can't fit in memory) then it isn't clear what the "thing to do" is. It is probably "sketching" but in ML/neural network land everyone just does SGD, which you can think of as a monte carlo estimate of the gradient (A for a linear problem). Maybe sketching and SGD are equivalent (or an appemroximation). "what you do" is based on convergence and stability characteristics.

nuclearnice1 · on Dec 28, 2019

Helpful. It makes sense that CG maintains sparsity. I didn’t realize it saw use in practice.