Differentiation for Hackers

tobmlt · on June 22, 2019

This is really a nice thing to put out there. I want to add that AD (automatic differentiation) can be a great jumping off point to other things. (It sure was for me in my studies anyway) Here is the path I followed (though I am sure there are many paths; I am too dumb and tired to dream up another right now. Perhaps that is why I am so eager to put this all out here in a "gushy" way -- my apologies):

First I implemented simple forward mode AD in Python (with NumPy, handling gradients and Hessians so you could apply AD to find extrema of constrained variational problems, (to design constrained B-spline geometry for example... or construct equations of motion), then extended to handle interval analysis, which introduces a tiny bit of topology (fixed point theorems and such - provably find all solutions to nonlinear systems over some domain).

Then pick up declarative programming (unification) and revisit overloading to build up a mathematical expression tree for anything you have implemented. (In detail, overload math such that, for some object which is a mathematical combination of two other objects, store in the new object's memory some representation of the two original objects, and also the operation used to combine them. Remember to track your automatically generated-connectives! (all expressions are ternary - n-ary stuff get broken out into ternary stuff) )

Then starting at the final "node" of this mathematical tree, walk back down the tree to build whatever you want. (Reverse mode ad can be done this way, and is a good thing to try) I used it to automatically compile math down into logical rules for constraint programming. Much fun!

Now re-do it all in C++ but use overloading to build a math lib to support everything, or, use eigen, or some other expression template lib. (I'm not done with this.)

Ah, expression template libraries are another instance of "reverse mode style" expression tree computation. AKA the "interpreter pattern". Good fun here, but more involved I think.

mlevental · on June 22, 2019

here's a very readable survey paper

http://jmlr.org/papers/volume18/17-468/17-468.pdf

Bostonian · on June 22, 2019

What do automatic differentiation tools do for functions with integer arguments?

osipov · on June 22, 2019

Great question. Both TensorFlow and PyTorch pretty much do nothing https://github.com/tensorflow/tensorflow/issues/20524

mcabbott · on June 22, 2019

That seems strange, if I'm reading it correctly.

In the author's packages, integers which label dimensions etc. have gradient `nothing`, but arrays which happen to contain integers do not signal anything:

    julia> Zygote.gradient((x,d)->abs(sum(sin,fft(x,d))), [2 2; 0 0], 2)
    (Complex{Float64}[-0.346356+0.0im 1.65364+0.0im; -2.0+0.0im 0.0+0.0im], nothing)

bmc7505 · on June 22, 2019

It is possible to do backprop with integers: https://arxiv.org/abs/1802.04680

adamnemecek · on June 22, 2019

Autodiff is insane. Kids waste how much time in school doing analytical calculus like troglodytes.

noelwelsh · on June 22, 2019

I don't really understand this comment. The rules of forward mode automatic differentiation are exactly the rules of symbolic differentiation one would learn in high school, except AD is evaluated at a particular point (thus avoiding the need to build an expression tree and thus avoiding exponential blow-up in the size of said tree).

AD is also on fairly shaky ground once you step outside what can be expressed in standard analytical calculus. For example it is fairly easy to express the same function in different ways and get derivatives that differ. A lot of work on semantics is needed.

adamnemecek · on June 22, 2019

> For example it is fairly easy to express the same function in different ways and get derivatives that differ.

Example?

gugagore · on June 22, 2019

I posted an example. y=cuberoot(x^3) and y=x should give different derivatives at x=0 in, I'm guessing, all autodiff approaches you'll get your hands on.

adamnemecek · on June 22, 2019

How does y=x break anything? It works fine.

gugagore · on June 22, 2019

Did you try taking the derivative of y = cuberoot(x^3) at 0? If it works fine, tell me your secret!

FooBarBizBazz · on June 22, 2019

Thank you, this is interesting. I just tried. You need L'Hospital's Rule.

I'd realized this was an issue with forward-mode autodiff, but hadn't thought about reverse-mode to realize it comes up there too.

In general, NaN is a placeholder for "all the derivatives you didn't know you needed in advance". I think the Dual Numbers shed light on this (see "Division": [https://en.wikipedia.org/wiki/Dual_number#Division]).

One could imagine lazily computing higher derivatives as needed for L'Hospital's Rule in either Forward or Reverse mode, however. For Forward Mode, I guess you'd replace your eager "Dual Number" pairs with lazy lists?

For reverse mode it seems you could do extra "passes" on the tape as needed... People who do this stuff for a living must know(?), but a quick Google doesn't turn up an answer for me...

FooBarBizBazz · on June 22, 2019

Actually I'm wrong about L'Hospital's Rule in this problem; it solves nothing here. I would insert an "[EDIT ...]" into my original post to avoid misleading people, but it's too late to edit.

It doesn't matter how many times you differentiate x^(1/3); you can not evaluate the result at zero. Which is funny, because you'd think you could differentiate x^3 and use the inverse function theorem. But there's that caveat about it not working when the derivative is zero.

I am now revisiting high school calculus and doubting my own sanity. Will post updates if/when I get my head right.

FooBarBizBazz · on June 22, 2019

Here's the (pedantically-worked) example with reverse-mode:

Program:

  y = g(x) = x^3
  z = f(y) = cuberoot(y)

Want: dz/dx

Reverse-mode autodiff:

  // Forward pass
  y = g(0) = 0^3         = 0      g'(0) = 3 x^2                           = 0
  z = f(0) = cuberoot(0) = 0      f'(0) = (1/3) / cuberoot(y^2) = (1/3)/0 = Inf

  // Reverse pass
  dz/dy = f'(0)                   = Inf
  dz/dx = dz/dy * g'(0) = Inf * 0 = NaN

adamnemecek · on June 22, 2019

Sorry, I was out.

Ok,

f(x) = cuberoot(x^3)

f(0 + eps) = cuberoot((0 + eps)^3) = cuberoot(eps^3) = eps

e.g. value is 0 and the slope is 1.

tntn · on June 22, 2019

    >>> import autograd.numpy as np
    >>> from autograd import grad
    >>> def fn(x):
    ...   return np.power(np.power(x, 3), 1/3)
    ...
    >>> gradfn = grad(fn)
    >>> gradfn(0.0)
    /usr/local/lib/python3.6/dist-packages/autograd/numpy/numpy_vjps.py:59:RuntimeWarning: divide by zero encountered in double_scalars
    lambda ans, x, y : unbroadcast_f(x, lambda g: g * y * x ** anp.where(y, y - 1, 1.)),
    /usr/local/lib/python3.6/dist-packages/autograd/numpy/numpy_vjps.py:59: RuntimeWarning: invalid value encountered in double_scalars
     lambda ans, x, y : unbroadcast_f(x, lambda g: g * y * x ** anp.where(y, y - 1, 1.)),
    nan

… Damn.

    >>> import torch
    >>> x = torch.tensor(0.0, requires_grad=True)
    >>> y = ((x**3) ** (1/3))
    >>> y.backward()
    >>> x.grad
    tensor(nan)

… Damn.

adamnemecek · on June 22, 2019

eps^3 is undefined

gugagore · on June 22, 2019

What is eps^3? ;) You have to turn that into some representation on the computer before you can even pass it to cuberoot.

adamnemecek · on June 22, 2019

Theres is no representation of that value.In this case power is defined only in the range 1-2. You need to simplify the roots first.

petschge · on June 22, 2019

except by the definition of dual numbers eps^2 is 0, as is eps^3. So the slope you computed is 0, which is rather wrong for (a complicated representation of) y=x.

adamnemecek · on June 22, 2019

Eps^3 is undefined. I don't see a problem with simplifying the exponents first.

> a complicated representation of) y=x.

The problem started as that. So no surprise.

tntn · on June 22, 2019

> I don't see a problem with simplifying the exponents first.

Sure, we call that computer algebra. As soon as you start doing that, you aren't doing automatic differentiation, you are doing (at least in part) symbolic differentiation.

adamnemecek · on June 22, 2019

You could similarly say that a computer can't calculate x^1000/x^998 because you can't fit it in 32/64 bits.

petschge · on June 22, 2019

And computers have a hard time with that. Take the following example code:

  #include <math.h>
  #include <stdio.h>
  #include <stdlib.h>
  double f(const double x) {
      return pow(x, 1000) / pow(x, 998);
  }
  int main(int argc, char* argv[]) {
      if(argc != 2) {
          fprintf(stderr, "Usage: %s x\n", argv[0]);
          exit(1);
      }
      const double x = atof(argv[1]); // demo code without error checking
      printf("x = %g, f(x) = %g\n", x, f(x));
      exit(0);
  }

compile with

  gcc -Wall -g -o test test.c -lm

and run it

  petschge@localhost:~$ ./test 0
    x = 0, f(x) = -nan
  petschge@localhost:~$ ./test 1
    x = 1, f(x) = 1
  petschge@localhost:~$ ./test 2
    x = 2, f(x) = 4

The fun thing is, when you compile with

  gcc -Wall -O3 -ffast-math -g -o test2 test.c -lm

you actually get

  petschge@localhost:~$ ./test2 0
    x = 0, f(x) = 0

adamnemecek · on June 22, 2019

That's exactly what i'm saying.

gugagore · on June 22, 2019

It's good to question what we teach. Like I never learned any methods for computing a square root, to arbitrary precision, whereas my parents did.

But autodiff doesn't replace symbolic differentiation. Consider the derivative of cuberoot(x^3). Autodiff struggles at x=0 because x^3 is flat (slope 0) and cuberoot is vertical (slope inf). And together, with autodiff, you get slope NaN.

If you wanted to try to fix that problem with autodiff, well first off it's likely not worth it because similar problems are inherent to autodiff, but you'll need an analytic understanding of what's going on.

one-more-minute · on June 22, 2019

It's true that most autodiff systems would behave this way, but a human given only the rules of calculus (the same rules as AD) would have trouble at the same point. Of course, most mathemeticians would apply additional insights while differentiating here (such as cancellation of terms or L'Hôpital's rule). But equally, you could add these things to an AD too; they are orthogonal to the basic derivative transformation, so it's really a difference of degree at best.

adamnemecek · on June 22, 2019

I'm out right now but this doesn't sound right. How does autodiff struggle with it?

gugagore · on June 22, 2019

You will have to put a bit of work into thinking about it, if you don't already see it. At the risk of being more rude than effective, I spelled it out pretty clearly above if you understand autodiff, which requires an understanding of the chain rule, which most people first learn about in a calculus class.

If you want to claim that autodiff is a replacement for all our differentiation needs, then you should try to understand when it doesn't work.

tobmlt · on June 22, 2019

The differentiation of a fractional power (fraction less than 1.) involves a negative fractional power. 0 raised to some negative power involves dividing by zero. This gives you some kind of NAN in, e.g., vanilla languages such as Python. The "issue with automatic differentiation" is really an issue with dividing by zero. Graph x(1./3.). It's slope is vertical at zero.

gugagore · on June 22, 2019

I don't think this is right. An autodiff package will likely report +inf for the derivative of cuberoot at 0.

The problem is really that there are a lot of ways the function can have infinite slope. The sqrt function also has infinite slope at 0.

Dividing by zero isn't where the NaN comes from.

tobmlt · on June 22, 2019

My mistake then. I did not intend to try and be so precise. I was thinking loosely. I have a very simple autodiff tool hand rolled (in Python) here to play with. I raise 0 to a negative exponent in the gradient of the cube root at 0. I do not get a NAN. Strictly speaking I get this:

"ZeroDivisionError: 0.0 cannot be raised to a negative power"

Is that satisfying? I do not claim anything more specific here. Sorry to be imprecise.

taneq · on June 22, 2019

You... do realise why doing these things analytically is worth while, right?

adamnemecek · on June 22, 2019

I dont, no.

wolfram74 · on June 22, 2019

Numerical methods can be efficient, but nothing beats an analytic method that can be simply evaluated after derived. Often times you can go from an approximate answer with numerous function evaluations to an exact answer in one simpler function evaluation. Pathological cases not withstanding.

adamnemecek · on June 22, 2019

AD is NOT a numerical method, it's very precise. You can in fact do it on paper as well. https://blog.demofox.org/2014/12/30/dual-numbers-automatic-d...

Analytical derivatives grow exponentially longer with the length of the expression. AD doesn't have this problem.

tntn · on June 22, 2019

Just because automatic differentiation is a different technique than numerical differentiation doesn't mean it isn't a numerical method.

(I probably should not say this, but I will anyway: your comments regarding math and science always give me the distinct impression that you know the name of a lot of things without knowing much about the thing. You are frequently breathless about analog quantum computing, Lie algebras, Chu spaces, some universal concept of duality, but then you fail to respond with substantial arguments to people that try to temper the breathlessness. If you want to convince people of things, you should work on this.)

adamnemecek · on June 22, 2019

> Just because automatic differentiation is a different technique than numerical differentiation doesn't mean it isn't a numerical method.

It does mean that. Pls see the overview paper http://jmlr.org/papers/volume18/17-468/17-468.pdf in particular section "2.1 AD Is Not Numerical Differentiation"

> substantial arguments to people that try to temper the breathlessness

Which arguments in particular?

> our comments regarding math and science always give me the distinct impression that you know the name of a lot of things without knowing much about the thing.

Ok be more specific, which ones do you care about? Maybe I don't always have time to answer everyone.

> some universal concept of duality

It's the concept of adjointness. And yes, it's very universal.

tntn · on June 22, 2019

Huh, interesting. First time I read this comment, it said something like "why should I care what you think?" I guess the edit indicates that you do care?

adamnemecek · on June 22, 2019

No, I edited it because HN has reached out to me many times before that I need to be less of an asshole or I'll get banned.

So no, I don't care about what you think, but I don't want to get banned.

Anyway, nice that you read the first version but still didn't respond to the other questions.

formalsystem · on June 22, 2019

I don't understand why this is being downvoted. It's not like doing division in long form is that meaningful.

petschge · on June 22, 2019

It is still important to know, even if most division problems you encounter are handed of to a calculator or computer. Why? Because understanding long division is a required background for understanding polynomial division. And understanding that is required to understand how you can get all root of a polynomial using Newtons method. And that is required to figure out how many different plasma modes there are. And THAT is a problem no computer can simply answer for you. Of course only a small number of people care and we don't teach about plasma modes in school, but we teach the basics that you need to start your way towards more complicated things. Long division of polynomials is also used in the construction of CRC sums.

So no, understanding long division is not meaningless. It give you a starting point for a lot of other things.

brighter2morrow · on June 22, 2019

Learning to do calculus allows you to do it on paper and in your head. Lines of code are liabilities that need to be maintained, understood concepts are assets that let you build, maintain, and simplify.

adamnemecek · on June 22, 2019

AD can be done on paper. Analytical calculus derivatives get exponentially longer with the length of the derived expression.

https://blog.demofox.org/2014/12/30/dual-numbers-automatic-d...

eli_gottlieb · on June 22, 2019

Autodiff only works because analytical calculus can prove that differentiation is a functor over functions.

jjtheblunt · on June 22, 2019

over continuous functions?

eli_gottlieb · on June 23, 2019

Well technically, over differentiable functions, but yes. Derivatives can be evaluated by decomposing functions, taking elementary derivatives, and recomposing the functions, in a way that integrals can't be.

nthompson · on June 22, 2019

Automatic differentiation only works on functions which are locally complex analytic. It fails on things that fall outside that model, but are still differentiable. Daubechies wavelets are a good example of this.

ChrisRackauckas · on June 22, 2019

Julia's systems can make use of ChainRules.jl which doesn't assume locally complex analytic, and instead uses its Wirtinger derivatives (df/dz and df/dzbar), and if complex analytic just uses the well-known simplification df/dzbar=0. This allows for non-analytic functions to be used, and you just need to supply all primitives in Wirtinger form. Julia has enough non-ML people using this stuff that complex numbers actually get used here :).

petschge · on June 22, 2019

Kids in school say "why do I to learn X, I am never going to need it in real live".

And then you end up with adults who don't believe that the earth is round, that vaccines are low-risk and extremely beneficial and think that the existence of snow means that there is no climate change and that homeopathy and essential oils are a valid alternative to modern medicine, etc.