I am not one of the elite. I had to drop out due to a not simple life. So I think I can offer an outside perspective. I did teach myself a lot of math to be able to understand machine learning. I did not read ahead, that is boring. But I found that as I went along, the more I wanted to tune the algorithm beyond the standard fare, the more power I wanted, well then the more math I had to read and understand. I simply was not satisfied to write decision tree code without understanding how. Or if you read a paper and you want to implement something you will need to be able to understand the math hiding the ideas because the pseudo code is often rubbish and bugged - assuming its even there at all. I did this to implement matrix factorization, to extend logistic regression beyond binary and with sparser weights, SVM etc.
And then something began to happen the more I learned. I no longer thought I need naive bayes this, Decision tree that, random forest there or whatever. I thought I need this concept from statistics or that idea from information theory, i just need to group and count there and that loss function is useful here. So I could come up or modify something to my need. As I go long I am finding that while before I looked for an excuse to use something fancy sounding now I prefer to go as simple as possible - but without having gone through the hard stage I could not appreciate where the simpler solution is better.
I also learned a great deal of differential calculus when implementing an automatic differentiator (a backpropogating Neural net is basically just a special case of reverse auto diff). Its fast can work with decent sized vectors (10^5 - 10^6 entries I tested) and can do gradients, hessians and jacobians of arbitrary functions. I also expect that I can easily extend it to be able to work with tensors although I haven't needed them yet. Using it I wrote a stochastic gradient descent algorithm and can plug in arbitrary loss functions and a whole bunch of algorithms just merge. I could also easily write say L-BFGS for it. Neural networks, logistic, linear regression, support vector were basically just swapping out one line.
This flexibility is what you gain.
===========================
In the below fn is an arbitrary mathematical function such as
let eq1 (x:float[]) = 1. - 4. * x.[0] + 2. * x.[0] ** 2.- 2. * x.[1] **3.
let newtonOpt prec iters fn (guess:float[]) =
let rec iterate delta iter cguess =
match delta with
| _ when delta < prec || iter > iters -> cguess
| _ -> let h = hessian cguess fn
let _, g, _ = grad_ cguess fn
let gs = cguess - m.Inverse() * g
let cdel = (gs - cguess).Norm(2.)
iterate cdel (iter + 1) gs
iterate Double.MaxValue 0 guess
example of a loss function
[<ReflectedDefinition>]
let llog (cx:float[]) _XdotW y = y * log (1./(1. + exp(_XdotW))) + (1. - y) * log(1. - (1./(1. + exp(_XdotW))))
And then something began to happen the more I learned. I no longer thought I need naive bayes this, Decision tree that, random forest there or whatever. I thought I need this concept from statistics or that idea from information theory, i just need to group and count there and that loss function is useful here. So I could come up or modify something to my need. As I go long I am finding that while before I looked for an excuse to use something fancy sounding now I prefer to go as simple as possible - but without having gone through the hard stage I could not appreciate where the simpler solution is better.
I also learned a great deal of differential calculus when implementing an automatic differentiator (a backpropogating Neural net is basically just a special case of reverse auto diff). Its fast can work with decent sized vectors (10^5 - 10^6 entries I tested) and can do gradients, hessians and jacobians of arbitrary functions. I also expect that I can easily extend it to be able to work with tensors although I haven't needed them yet. Using it I wrote a stochastic gradient descent algorithm and can plug in arbitrary loss functions and a whole bunch of algorithms just merge. I could also easily write say L-BFGS for it. Neural networks, logistic, linear regression, support vector were basically just swapping out one line.
This flexibility is what you gain.
===========================
In the below fn is an arbitrary mathematical function such as
example of a loss function