This is really a nice thing to put out there. I want to add that AD (automatic differentiation) can be a great jumping off point to other things. (It sure was for me in my studies anyway) Here is the path I followed (though I am sure there are many paths; I am too dumb and tired to dream up another right now. Perhaps that is why I am so eager to put this all out here in a "gushy" way -- my apologies):
First I implemented simple forward mode AD in Python (with NumPy, handling gradients and Hessians so you could apply AD to find extrema of constrained variational problems, (to design constrained B-spline geometry for example... or construct equations of motion), then extended to handle interval analysis, which introduces a tiny bit of topology (fixed point theorems and such - provably find all solutions to nonlinear systems over some domain).
Then pick up declarative programming (unification) and revisit overloading to build up a mathematical expression tree for anything you have implemented. (In detail, overload math such that, for some object which is a mathematical combination of two other objects, store in the new object's memory some representation of the two original objects, and also the operation used to combine them. Remember to track your automatically generated-connectives! (all expressions are ternary - n-ary stuff get broken out into ternary stuff) )
Then starting at the final "node" of this mathematical tree, walk back down the tree to build whatever you want. (Reverse mode ad can be done this way, and is a good thing to try) I used it to automatically compile math down into logical rules for constraint programming. Much fun!
Now re-do it all in C++ but use overloading to build a math lib to support everything, or, use eigen, or some other expression template lib. (I'm not done with this.)
Ah, expression template libraries are another instance of "reverse mode style" expression tree computation. AKA the "interpreter pattern". Good fun here, but more involved I think.
In the author's packages, integers which label dimensions etc. have gradient `nothing`, but arrays which happen to contain integers do not signal anything:
I don't really understand this comment. The rules of forward mode automatic differentiation are exactly the rules of symbolic differentiation one would learn in high school, except AD is evaluated at a particular point (thus avoiding the need to build an expression tree and thus avoiding exponential blow-up in the size of said tree).
AD is also on fairly shaky ground once you step outside what can be expressed in standard analytical calculus. For example it is fairly easy to express the same function in different ways and get derivatives that differ. A lot of work on semantics is needed.
I posted an example. y=cuberoot(x^3) and y=x should give different derivatives at x=0 in, I'm guessing, all autodiff approaches you'll get your hands on.
Thank you, this is interesting. I just tried. You need L'Hospital's Rule.
I'd realized this was an issue with forward-mode autodiff, but hadn't thought about reverse-mode to realize it comes up there too.
In general, NaN is a placeholder for "all the derivatives you didn't know you needed in advance". I think the Dual Numbers shed light on this (see "Division": [https://en.wikipedia.org/wiki/Dual_number#Division]).
One could imagine lazily computing higher derivatives as needed for L'Hospital's Rule in either Forward or Reverse mode, however. For Forward Mode, I guess you'd replace your eager "Dual Number" pairs with lazy lists?
For reverse mode it seems you could do extra "passes" on the tape as needed... People who do this stuff for a living must know(?), but a quick Google doesn't turn up an answer for me...
Actually I'm wrong about L'Hospital's Rule in this problem; it solves nothing here. I would insert an "[EDIT ...]" into my original post to avoid misleading people, but it's too late to edit.
It doesn't matter how many times you differentiate x^(1/3); you can not evaluate the result at zero. Which is funny, because you'd think you could differentiate x^3 and use the inverse function theorem. But there's that caveat about it not working when the derivative is zero.
I am now revisiting high school calculus and doubting my own sanity. Will post updates if/when I get my head right.
>>> import autograd.numpy as np
>>> from autograd import grad
>>> def fn(x):
... return np.power(np.power(x, 3), 1/3)
...
>>> gradfn = grad(fn)
>>> gradfn(0.0)
/usr/local/lib/python3.6/dist-packages/autograd/numpy/numpy_vjps.py:59:RuntimeWarning: divide by zero encountered in double_scalars
lambda ans, x, y : unbroadcast_f(x, lambda g: g * y * x ** anp.where(y, y - 1, 1.)),
/usr/local/lib/python3.6/dist-packages/autograd/numpy/numpy_vjps.py:59: RuntimeWarning: invalid value encountered in double_scalars
lambda ans, x, y : unbroadcast_f(x, lambda g: g * y * x ** anp.where(y, y - 1, 1.)),
nan
… Damn.
>>> import torch
>>> x = torch.tensor(0.0, requires_grad=True)
>>> y = ((x**3) ** (1/3))
>>> y.backward()
>>> x.grad
tensor(nan)
except by the definition of dual numbers eps^2 is 0, as is eps^3. So the slope you computed is 0, which is rather wrong for (a complicated representation of) y=x.
> I don't see a problem with simplifying the exponents first.
Sure, we call that computer algebra. As soon as you start doing that, you aren't doing automatic differentiation, you are doing (at least in part) symbolic differentiation.
It's good to question what we teach. Like I never learned any methods for computing a square root, to arbitrary precision, whereas my parents did.
But autodiff doesn't replace symbolic differentiation. Consider the derivative of cuberoot(x^3). Autodiff struggles at x=0 because x^3 is flat (slope 0) and cuberoot is vertical (slope inf). And together, with autodiff, you get slope NaN.
If you wanted to try to fix that problem with autodiff, well first off it's likely not worth it because similar problems are inherent to autodiff, but you'll need an analytic understanding of what's going on.
It's true that most autodiff systems would behave this way, but a human given only the rules of calculus (the same rules as AD) would have trouble at the same point. Of course, most mathemeticians would apply additional insights while differentiating here (such as cancellation of terms or L'Hôpital's rule). But equally, you could add these things to an AD too; they are orthogonal to the basic derivative transformation, so it's really a difference of degree at best.
You will have to put a bit of work into thinking about it, if you don't already see it. At the risk of being more rude than effective, I spelled it out pretty clearly above if you understand autodiff, which requires an understanding of the chain rule, which most people first learn about in a calculus class.
If you want to claim that autodiff is a replacement for all our differentiation needs, then you should try to understand when it doesn't work.
The differentiation of a fractional power (fraction less than 1.) involves a negative fractional power. 0 raised to some negative power involves dividing by zero. This gives you some kind of NAN in, e.g., vanilla languages such as Python. The "issue with automatic differentiation" is really an issue with dividing by zero. Graph x(1./3.). It's slope is vertical at zero.
My mistake then. I did not intend to try and be so precise. I was thinking loosely. I have a very simple autodiff tool hand rolled (in Python) here to play with. I raise 0 to a negative exponent in the gradient of the cube root at 0. I do not get a NAN. Strictly speaking I get this:
"ZeroDivisionError: 0.0 cannot be raised to a negative power"
Is that satisfying? I do not claim anything more specific here. Sorry to be imprecise.
Numerical methods can be efficient, but nothing beats an analytic method that can be simply evaluated after derived. Often times you can go from an approximate answer with numerous function evaluations to an exact answer in one simpler function evaluation. Pathological cases not withstanding.
Just because automatic differentiation is a different technique than numerical differentiation doesn't mean it isn't a numerical method.
(I probably should not say this, but I will anyway: your comments regarding math and science always give me the distinct impression that you know the name of a lot of things without knowing much about the thing. You are frequently breathless about analog quantum computing, Lie algebras, Chu spaces, some universal concept of duality, but then you fail to respond with substantial arguments to people that try to temper the breathlessness. If you want to convince people of things, you should work on this.)
> substantial arguments to people that try to temper the breathlessness
Which arguments in particular?
> our comments regarding math and science always give me the distinct impression that you know the name of a lot of things without knowing much about the thing.
Ok be more specific, which ones do you care about? Maybe I don't always have time to answer everyone.
> some universal concept of duality
It's the concept of adjointness. And yes, it's very universal.
Huh, interesting. First time I read this comment, it said something like "why should I care what you think?" I guess the edit indicates that you do care?
It is still important to know, even if most division problems you encounter are handed of to a calculator or computer. Why? Because understanding long division is a required background for understanding polynomial division. And understanding that is required to understand how you can get all root of a polynomial using Newtons method. And that is required to figure out how many different plasma modes there are. And THAT is a problem no computer can simply answer for you. Of course only a small number of people care and we don't teach about plasma modes in school, but we teach the basics that you need to start your way towards more complicated things. Long division of polynomials is also used in the construction of CRC sums.
So no, understanding long division is not meaningless. It give you a starting point for a lot of other things.
Learning to do calculus allows you to do it on paper and in your head. Lines of code are liabilities that need to be maintained, understood concepts are assets that let you build, maintain, and simplify.
Well technically, over differentiable functions, but yes. Derivatives can be evaluated by decomposing functions, taking elementary derivatives, and recomposing the functions, in a way that integrals can't be.
Automatic differentiation only works on functions which are locally complex analytic. It fails on things that fall outside that model, but are still differentiable. Daubechies wavelets are a good example of this.
Julia's systems can make use of ChainRules.jl which doesn't assume locally complex analytic, and instead uses its Wirtinger derivatives (df/dz and df/dzbar), and if complex analytic just uses the well-known simplification df/dzbar=0. This allows for non-analytic functions to be used, and you just need to supply all primitives in Wirtinger form. Julia has enough non-ML people using this stuff that complex numbers actually get used here :).
Kids in school say "why do I to learn X, I am never going to need it in real live".
And then you end up with adults who don't believe that the earth is round, that vaccines are low-risk and extremely beneficial and think that the existence of snow means that there is no climate change and that homeopathy and essential oils are a valid alternative to modern medicine, etc.
First I implemented simple forward mode AD in Python (with NumPy, handling gradients and Hessians so you could apply AD to find extrema of constrained variational problems, (to design constrained B-spline geometry for example... or construct equations of motion), then extended to handle interval analysis, which introduces a tiny bit of topology (fixed point theorems and such - provably find all solutions to nonlinear systems over some domain).
Then pick up declarative programming (unification) and revisit overloading to build up a mathematical expression tree for anything you have implemented. (In detail, overload math such that, for some object which is a mathematical combination of two other objects, store in the new object's memory some representation of the two original objects, and also the operation used to combine them. Remember to track your automatically generated-connectives! (all expressions are ternary - n-ary stuff get broken out into ternary stuff) )
Then starting at the final "node" of this mathematical tree, walk back down the tree to build whatever you want. (Reverse mode ad can be done this way, and is a good thing to try) I used it to automatically compile math down into logical rules for constraint programming. Much fun!
Now re-do it all in C++ but use overloading to build a math lib to support everything, or, use eigen, or some other expression template lib. (I'm not done with this.)
Ah, expression template libraries are another instance of "reverse mode style" expression tree computation. AKA the "interpreter pattern". Good fun here, but more involved I think.