A decade ago, symbolic differentiation could also be used to compute normal vectors for mathematically defined surfaces (like iso surfaces). Conal Elliott (the guy who co-invented FRP) wrote a great paper on this with fairly well defined semantics:
Automatic differentiation is a little different than symbolic differentiation. Rather than use a new high level description of the computational graph, tools like autograd [1] can differentiate somewhat arbitrary Python and numpy instructions.
> can differentiate somewhat arbitrary Python and numpy instructions.
I hear this a lot but it doesn't make sense to me. How do you differentiate an conditional or a loop? It doesn't make sense mathematically, so I'm not sure what sense it makes programmatically.
For a loop, you can unroll it and define the derivative that way. But you're correct in general that there are programs that have no sensible derivative but automatic differentiation will still compute something.
Going beyond AD there is a "differential lambda calculus" [1]. I don't really understand this work. It requires more background in programming languages, logic, and analysis than I have time to learn. My understanding is it allows you to compute more derivatives (e.g. for higher-order functions) and place conditions so only sensible derivatives are computed. There is also "differential linear logic" [2] which has some relation to the differential lambda calculus but it's unclear to me what that relationship is.
So in summary I believe:
1. automatic differentiation will compute the correct derivative when such a thing is available, and some nonsense answer otherwise
2. to increase the space of programs for which derivatives can be computed and to rule out other programs you need to look at differential lambda calculus / linear logic.
Do you have a simple example of a loop with no sensible derivative? Or rather, of a loop computing something which should have one, but for which AD fails?
If your loop is computing something like f(g(...h(x))) i.e. applying functions in sequence, starting with x, then there's no problem -- dual numbers will compute the correct derivative.
If the loop is an implementation detail, like I think the poisson distribution p(λ,k) is often computed with a loop up to k... then making λ a dual number still works. Clearly you can't differentiate on k, because it's an integer.
I don't have a simple example (and I'm not claiming [some] loops do not have sensible derivatives).
This talk from 27mins onwards discusses some of the issues if you're interested in diving in a bit deeper:
https://www.youtube.com/watch?v=qhPBfysSYI8 I found it pretty heavy going. YMMV.
You regard a conditional is a piecewise smooth function, like f(x) = {0 if x<0, x^2 if x>0}. Then the derivative is that of the smooth piece you evaluated.
Or to say that another way, since the derivative by definition changes x infinitesimally, it never changes on which side of the conditional x is located.
For functions that are not differentiable you can use the subdifferential [0] (or subgradient) which gives a set rather than a single value. In your example this would be the interval [-2,1].
In practice, from what I've read it seems to be fine to just take any value between these and most commonly they are just averaged. It helps that these regions are rarely encountered.
For more detail, check out Chapter 6 [1], Section 6.3 "Hidden Units" in The Deep Learning book.
Clearly y'(0) = 0 here, since at x=0 the function is defined to be a constant.
In calculus class this would be ill-defined / ambiguous, but the computer is simple-minded and picks one answer.
This doesn't seem like a major flaw to me. What are you using this for which would care? Maybe you have something like a rectification which produces exactly x=0 much more often than 1 in 2^32... in which case a function like your y amounts to a special case for this.
If you cared you could write a special case for the derivative too, here's one which will give y'(0)=42 but otherwise the automatic derivative:
> Clearly y'(0) = 0 here, since at x=0 the function is defined to be a constant.
Every function is constant at every point. For example, I could define the identity function as `y ( x ) = if ( x > 0 ) { x } elsif ( x < 0 ) { x } else { 0 }`, but that doesn't mean that its derivative at the origin is `0`.
No, dataflow's function y is defined in three smooth pieces, one of which is a constant. That's what this differentiation is seeing. It is evaluating at x+ϵ, and at 0+ϵ takes the ==0 branch which returns always 0, which has gradient 0.
Your function also has three pieces, and the only difference from `y ( x ) = x` is that you've chosen to specify a derivative at zero, i.e. y(x+ϵ) = if ( x==0 ) { 0+0ϵ } else { x+1ϵ }. Which you may have reason to do, say to tell your optimiser to stop pushing on the wall.
Yeah, this is a good case to use multiple dispatch to flexibly tie into systems built on generic code. The problem with autodiff is it doesn't necessarily take all branches to know how to deal with these kinds of issues, but having the power to fix it yourself is crucial.
If one defines a subset DSL that can be auto differentiated via a symbolic method, then is it really that different? Embedded DSLs are pretty common in the Haskell world anyways.
Symbolic differentiation doesn't work for loops and conditionals. The expression of the derivative is also exponential in the size of the source expression.
Article is a bit broken for me because while the page is https, the mathjax file is loaded from a plain http. This is apparently forbidden by current Chrome and Firefox, so the figures and LaTeX don't work.
edit: ah-hah, HTTPS Everywhere must have fiddled with it.
I don't understand in what sense programs can be called “differentiable”. Is the space of programs modulo observational equivalence a manifold to begin with? (I don't think it's Hausdorff or even T1, but I could be wrong.)
The examples given in the article are merely derivatives of ordinary mathematical functions defined by ordinary mathematical expressions - in particular, there are no sequencing, no conditionals and no loops. So why call them “differentiable programs” when you are actually dealing with ordinary differentiable functions from good old 19th century analysis? We need urgent improvements in the intellectual honesty department.
http://conal.net/Vertigo/
And
http://conal.net/papers/beautiful-differentiation/