Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Differentiation for Hackers (github.com/mikeinnes)
176 points by adamnemecek on June 22, 2019 | hide | past | favorite | 51 comments


This is really a nice thing to put out there. I want to add that AD (automatic differentiation) can be a great jumping off point to other things. (It sure was for me in my studies anyway) Here is the path I followed (though I am sure there are many paths; I am too dumb and tired to dream up another right now. Perhaps that is why I am so eager to put this all out here in a "gushy" way -- my apologies):

First I implemented simple forward mode AD in Python (with NumPy, handling gradients and Hessians so you could apply AD to find extrema of constrained variational problems, (to design constrained B-spline geometry for example... or construct equations of motion), then extended to handle interval analysis, which introduces a tiny bit of topology (fixed point theorems and such - provably find all solutions to nonlinear systems over some domain).

Then pick up declarative programming (unification) and revisit overloading to build up a mathematical expression tree for anything you have implemented. (In detail, overload math such that, for some object which is a mathematical combination of two other objects, store in the new object's memory some representation of the two original objects, and also the operation used to combine them. Remember to track your automatically generated-connectives! (all expressions are ternary - n-ary stuff get broken out into ternary stuff) )

Then starting at the final "node" of this mathematical tree, walk back down the tree to build whatever you want. (Reverse mode ad can be done this way, and is a good thing to try) I used it to automatically compile math down into logical rules for constraint programming. Much fun!

Now re-do it all in C++ but use overloading to build a math lib to support everything, or, use eigen, or some other expression template lib. (I'm not done with this.)

Ah, expression template libraries are another instance of "reverse mode style" expression tree computation. AKA the "interpreter pattern". Good fun here, but more involved I think.


here's a very readable survey paper

http://jmlr.org/papers/volume18/17-468/17-468.pdf


What do automatic differentiation tools do for functions with integer arguments?


Great question. Both TensorFlow and PyTorch pretty much do nothing https://github.com/tensorflow/tensorflow/issues/20524


That seems strange, if I'm reading it correctly.

In the author's packages, integers which label dimensions etc. have gradient `nothing`, but arrays which happen to contain integers do not signal anything:

    julia> Zygote.gradient((x,d)->abs(sum(sin,fft(x,d))), [2 2; 0 0], 2)
    (Complex{Float64}[-0.346356+0.0im 1.65364+0.0im; -2.0+0.0im 0.0+0.0im], nothing)


It is possible to do backprop with integers: https://arxiv.org/abs/1802.04680


Autodiff is insane. Kids waste how much time in school doing analytical calculus like troglodytes.


I don't really understand this comment. The rules of forward mode automatic differentiation are exactly the rules of symbolic differentiation one would learn in high school, except AD is evaluated at a particular point (thus avoiding the need to build an expression tree and thus avoiding exponential blow-up in the size of said tree).

AD is also on fairly shaky ground once you step outside what can be expressed in standard analytical calculus. For example it is fairly easy to express the same function in different ways and get derivatives that differ. A lot of work on semantics is needed.


> For example it is fairly easy to express the same function in different ways and get derivatives that differ.

Example?


I posted an example. y=cuberoot(x^3) and y=x should give different derivatives at x=0 in, I'm guessing, all autodiff approaches you'll get your hands on.


How does y=x break anything? It works fine.


Did you try taking the derivative of y = cuberoot(x^3) at 0? If it works fine, tell me your secret!


Thank you, this is interesting. I just tried. You need L'Hospital's Rule.

I'd realized this was an issue with forward-mode autodiff, but hadn't thought about reverse-mode to realize it comes up there too.

In general, NaN is a placeholder for "all the derivatives you didn't know you needed in advance". I think the Dual Numbers shed light on this (see "Division": [https://en.wikipedia.org/wiki/Dual_number#Division]).

One could imagine lazily computing higher derivatives as needed for L'Hospital's Rule in either Forward or Reverse mode, however. For Forward Mode, I guess you'd replace your eager "Dual Number" pairs with lazy lists?

For reverse mode it seems you could do extra "passes" on the tape as needed... People who do this stuff for a living must know(?), but a quick Google doesn't turn up an answer for me...


Actually I'm wrong about L'Hospital's Rule in this problem; it solves nothing here. I would insert an "[EDIT ...]" into my original post to avoid misleading people, but it's too late to edit.

It doesn't matter how many times you differentiate x^(1/3); you can not evaluate the result at zero. Which is funny, because you'd think you could differentiate x^3 and use the inverse function theorem. But there's that caveat about it not working when the derivative is zero.

I am now revisiting high school calculus and doubting my own sanity. Will post updates if/when I get my head right.


Here's the (pedantically-worked) example with reverse-mode:

Program:

  y = g(x) = x^3
  z = f(y) = cuberoot(y)
Want: dz/dx

Reverse-mode autodiff:

  // Forward pass
  y = g(0) = 0^3         = 0      g'(0) = 3 x^2                           = 0
  z = f(0) = cuberoot(0) = 0      f'(0) = (1/3) / cuberoot(y^2) = (1/3)/0 = Inf

  // Reverse pass
  dz/dy = f'(0)                   = Inf
  dz/dx = dz/dy * g'(0) = Inf * 0 = NaN


Sorry, I was out.

Ok,

f(x) = cuberoot(x^3)

f(0 + eps) = cuberoot((0 + eps)^3) = cuberoot(eps^3) = eps

e.g. value is 0 and the slope is 1.


    >>> import autograd.numpy as np
    >>> from autograd import grad
    >>> def fn(x):
    ...   return np.power(np.power(x, 3), 1/3)
    ...
    >>> gradfn = grad(fn)
    >>> gradfn(0.0)
    /usr/local/lib/python3.6/dist-packages/autograd/numpy/numpy_vjps.py:59:RuntimeWarning: divide by zero encountered in double_scalars
    lambda ans, x, y : unbroadcast_f(x, lambda g: g * y * x ** anp.where(y, y - 1, 1.)),
    /usr/local/lib/python3.6/dist-packages/autograd/numpy/numpy_vjps.py:59: RuntimeWarning: invalid value encountered in double_scalars
     lambda ans, x, y : unbroadcast_f(x, lambda g: g * y * x ** anp.where(y, y - 1, 1.)),
    nan
… Damn.

    >>> import torch
    >>> x = torch.tensor(0.0, requires_grad=True)
    >>> y = ((x**3) ** (1/3))
    >>> y.backward()
    >>> x.grad
    tensor(nan)
… Damn.


eps^3 is undefined


What is eps^3? ;) You have to turn that into some representation on the computer before you can even pass it to cuberoot.


Theres is no representation of that value.In this case power is defined only in the range 1-2. You need to simplify the roots first.


except by the definition of dual numbers eps^2 is 0, as is eps^3. So the slope you computed is 0, which is rather wrong for (a complicated representation of) y=x.


Eps^3 is undefined. I don't see a problem with simplifying the exponents first.

> a complicated representation of) y=x.

The problem started as that. So no surprise.


> I don't see a problem with simplifying the exponents first.

Sure, we call that computer algebra. As soon as you start doing that, you aren't doing automatic differentiation, you are doing (at least in part) symbolic differentiation.


You could similarly say that a computer can't calculate x^1000/x^998 because you can't fit it in 32/64 bits.


And computers have a hard time with that. Take the following example code:

  #include <math.h>
  #include <stdio.h>
  #include <stdlib.h>
  double f(const double x) {
      return pow(x, 1000) / pow(x, 998);
  }
  int main(int argc, char* argv[]) {
      if(argc != 2) {
          fprintf(stderr, "Usage: %s x\n", argv[0]);
          exit(1);
      }
      const double x = atof(argv[1]); // demo code without error checking
      printf("x = %g, f(x) = %g\n", x, f(x));
      exit(0);
  }
compile with

  gcc -Wall -g -o test test.c -lm
and run it

  petschge@localhost:~$ ./test 0
    x = 0, f(x) = -nan
  petschge@localhost:~$ ./test 1
    x = 1, f(x) = 1
  petschge@localhost:~$ ./test 2
    x = 2, f(x) = 4
The fun thing is, when you compile with

  gcc -Wall -O3 -ffast-math -g -o test2 test.c -lm
you actually get

  petschge@localhost:~$ ./test2 0
    x = 0, f(x) = 0


That's exactly what i'm saying.


It's good to question what we teach. Like I never learned any methods for computing a square root, to arbitrary precision, whereas my parents did.

But autodiff doesn't replace symbolic differentiation. Consider the derivative of cuberoot(x^3). Autodiff struggles at x=0 because x^3 is flat (slope 0) and cuberoot is vertical (slope inf). And together, with autodiff, you get slope NaN.

If you wanted to try to fix that problem with autodiff, well first off it's likely not worth it because similar problems are inherent to autodiff, but you'll need an analytic understanding of what's going on.


It's true that most autodiff systems would behave this way, but a human given only the rules of calculus (the same rules as AD) would have trouble at the same point. Of course, most mathemeticians would apply additional insights while differentiating here (such as cancellation of terms or L'Hôpital's rule). But equally, you could add these things to an AD too; they are orthogonal to the basic derivative transformation, so it's really a difference of degree at best.


I'm out right now but this doesn't sound right. How does autodiff struggle with it?


You will have to put a bit of work into thinking about it, if you don't already see it. At the risk of being more rude than effective, I spelled it out pretty clearly above if you understand autodiff, which requires an understanding of the chain rule, which most people first learn about in a calculus class.

If you want to claim that autodiff is a replacement for all our differentiation needs, then you should try to understand when it doesn't work.


The differentiation of a fractional power (fraction less than 1.) involves a negative fractional power. 0 raised to some negative power involves dividing by zero. This gives you some kind of NAN in, e.g., vanilla languages such as Python. The "issue with automatic differentiation" is really an issue with dividing by zero. Graph x(1./3.). It's slope is vertical at zero.


I don't think this is right. An autodiff package will likely report +inf for the derivative of cuberoot at 0.

The problem is really that there are a lot of ways the function can have infinite slope. The sqrt function also has infinite slope at 0.

Dividing by zero isn't where the NaN comes from.


My mistake then. I did not intend to try and be so precise. I was thinking loosely. I have a very simple autodiff tool hand rolled (in Python) here to play with. I raise 0 to a negative exponent in the gradient of the cube root at 0. I do not get a NAN. Strictly speaking I get this:

"ZeroDivisionError: 0.0 cannot be raised to a negative power"

Is that satisfying? I do not claim anything more specific here. Sorry to be imprecise.


You... do realise why doing these things analytically is worth while, right?


I dont, no.


Numerical methods can be efficient, but nothing beats an analytic method that can be simply evaluated after derived. Often times you can go from an approximate answer with numerous function evaluations to an exact answer in one simpler function evaluation. Pathological cases not withstanding.


AD is NOT a numerical method, it's very precise. You can in fact do it on paper as well. https://blog.demofox.org/2014/12/30/dual-numbers-automatic-d...

Analytical derivatives grow exponentially longer with the length of the expression. AD doesn't have this problem.


Just because automatic differentiation is a different technique than numerical differentiation doesn't mean it isn't a numerical method.

(I probably should not say this, but I will anyway: your comments regarding math and science always give me the distinct impression that you know the name of a lot of things without knowing much about the thing. You are frequently breathless about analog quantum computing, Lie algebras, Chu spaces, some universal concept of duality, but then you fail to respond with substantial arguments to people that try to temper the breathlessness. If you want to convince people of things, you should work on this.)


> Just because automatic differentiation is a different technique than numerical differentiation doesn't mean it isn't a numerical method.

It does mean that. Pls see the overview paper http://jmlr.org/papers/volume18/17-468/17-468.pdf in particular section "2.1 AD Is Not Numerical Differentiation"

> substantial arguments to people that try to temper the breathlessness

Which arguments in particular?

> our comments regarding math and science always give me the distinct impression that you know the name of a lot of things without knowing much about the thing.

Ok be more specific, which ones do you care about? Maybe I don't always have time to answer everyone.

> some universal concept of duality

It's the concept of adjointness. And yes, it's very universal.


Huh, interesting. First time I read this comment, it said something like "why should I care what you think?" I guess the edit indicates that you do care?


No, I edited it because HN has reached out to me many times before that I need to be less of an asshole or I'll get banned.

So no, I don't care about what you think, but I don't want to get banned.

Anyway, nice that you read the first version but still didn't respond to the other questions.


I don't understand why this is being downvoted. It's not like doing division in long form is that meaningful.


It is still important to know, even if most division problems you encounter are handed of to a calculator or computer. Why? Because understanding long division is a required background for understanding polynomial division. And understanding that is required to understand how you can get all root of a polynomial using Newtons method. And that is required to figure out how many different plasma modes there are. And THAT is a problem no computer can simply answer for you. Of course only a small number of people care and we don't teach about plasma modes in school, but we teach the basics that you need to start your way towards more complicated things. Long division of polynomials is also used in the construction of CRC sums.

So no, understanding long division is not meaningless. It give you a starting point for a lot of other things.


Learning to do calculus allows you to do it on paper and in your head. Lines of code are liabilities that need to be maintained, understood concepts are assets that let you build, maintain, and simplify.


AD can be done on paper. Analytical calculus derivatives get exponentially longer with the length of the derived expression.

https://blog.demofox.org/2014/12/30/dual-numbers-automatic-d...


Autodiff only works because analytical calculus can prove that differentiation is a functor over functions.


over continuous functions?


Well technically, over differentiable functions, but yes. Derivatives can be evaluated by decomposing functions, taking elementary derivatives, and recomposing the functions, in a way that integrals can't be.


Automatic differentiation only works on functions which are locally complex analytic. It fails on things that fall outside that model, but are still differentiable. Daubechies wavelets are a good example of this.


Julia's systems can make use of ChainRules.jl which doesn't assume locally complex analytic, and instead uses its Wirtinger derivatives (df/dz and df/dzbar), and if complex analytic just uses the well-known simplification df/dzbar=0. This allows for non-analytic functions to be used, and you just need to supply all primitives in Wirtinger form. Julia has enough non-ML people using this stuff that complex numbers actually get used here :).


Kids in school say "why do I to learn X, I am never going to need it in real live".

And then you end up with adults who don't believe that the earth is round, that vaccines are low-risk and extremely beneficial and think that the existence of snow means that there is no climate change and that homeopathy and essential oils are a valid alternative to modern medicine, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: