Differentiable Programming: A Semantics Perspective

seanmcdirmid · on May 3, 2018

A decade ago, symbolic differentiation could also be used to compute normal vectors for mathematically defined surfaces (like iso surfaces). Conal Elliott (the guy who co-invented FRP) wrote a great paper on this with fairly well defined semantics:

http://conal.net/Vertigo/

And

http://conal.net/papers/beautiful-differentiation/

davidwihl · on May 3, 2018

Automatic differentiation is a little different than symbolic differentiation. Rather than use a new high level description of the computational graph, tools like autograd [1] can differentiate somewhat arbitrary Python and numpy instructions.

[1] https://github.com/HIPS/autograd

dataflow · on May 3, 2018

> can differentiate somewhat arbitrary Python and numpy instructions.

I hear this a lot but it doesn't make sense to me. How do you differentiate an conditional or a loop? It doesn't make sense mathematically, so I'm not sure what sense it makes programmatically.

noelwelsh · on May 3, 2018

For a loop, you can unroll it and define the derivative that way. But you're correct in general that there are programs that have no sensible derivative but automatic differentiation will still compute something.

Going beyond AD there is a "differential lambda calculus" [1]. I don't really understand this work. It requires more background in programming languages, logic, and analysis than I have time to learn. My understanding is it allows you to compute more derivatives (e.g. for higher-order functions) and place conditions so only sensible derivatives are computed. There is also "differential linear logic" [2] which has some relation to the differential lambda calculus but it's unclear to me what that relationship is.

So in summary I believe:

1. automatic differentiation will compute the correct derivative when such a thing is available, and some nonsense answer otherwise

2. to increase the space of programs for which derivatives can be computed and to rule out other programs you need to look at differential lambda calculus / linear logic.

[1]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.471... [2]: https://arxiv.org/abs/1606.01642

dataflow · on May 3, 2018

> automatic differentiation will compute the correct derivative when such a thing is available, and some nonsense answer otherwise

The problem is this is false too. Just modify my example a bit:

  y = x if x > 0 else x if x < 0 else 0

Here we clearly have y'(0) = 1, yet autodiff would seem to give y'(0) = 0.

noelwelsh · on May 3, 2018

Good point / example!

improbable22 · on May 3, 2018

Do you have a simple example of a loop with no sensible derivative? Or rather, of a loop computing something which should have one, but for which AD fails?

If your loop is computing something like f(g(...h(x))) i.e. applying functions in sequence, starting with x, then there's no problem -- dual numbers will compute the correct derivative.

If the loop is an implementation detail, like I think the poisson distribution p(λ,k) is often computed with a loop up to k... then making λ a dual number still works. Clearly you can't differentiate on k, because it's an integer.

noelwelsh · on May 3, 2018

I don't have a simple example (and I'm not claiming [some] loops do not have sensible derivatives).

This talk from 27mins onwards discusses some of the issues if you're interested in diving in a bit deeper: https://www.youtube.com/watch?v=qhPBfysSYI8 I found it pretty heavy going. YMMV.

improbable22 · on May 3, 2018

Thanks! Will take a look.

improbable22 · on May 3, 2018

You regard a conditional is a piecewise smooth function, like f(x) = {0 if x<0, x^2 if x>0}. Then the derivative is that of the smooth piece you evaluated.

Or to say that another way, since the derivative by definition changes x infinitesimally, it never changes on which side of the conditional x is located.

dataflow · on May 3, 2018

What if it was the following, with x = 0?

  y = x if x > 0 else -2 * x if x < 0 else 0

OisinMoran · on May 3, 2018

For functions that are not differentiable you can use the subdifferential [0] (or subgradient) which gives a set rather than a single value. In your example this would be the interval [-2,1].

In practice, from what I've read it seems to be fine to just take any value between these and most commonly they are just averaged. It helps that these regions are rarely encountered.

For more detail, check out Chapter 6 [1], Section 6.3 "Hidden Units" in The Deep Learning book.

[0] https://en.wikipedia.org/wiki/Subderivative

[1] http://www.deeplearningbook.org/contents/mlp.html

dataflow · on May 3, 2018

Nope. I thought about subderivative too... but then I realized this isn't even true. Constructing a counterexample is trivial:

  y = x if x > 0 else x if x < 0 else 0

Zero is in no way a subderivative at x = 0.

OisinMoran · on May 3, 2018

Yeah you're right, the subdifferential of f(x)=|x| at the origin is the interval [−1, 1].

In the fourth paragraph of link [1] in the above post they explain quite well why we can get away with this type of trickery even if it seems wrong.

dataflow · on May 3, 2018

> Yeah you're right, the subdifferential of f(x)=|x| at the origin is the interval [−1, 1].

I'm confused where you're seeing absolute value? I never had absolute value anywhere. The second example I wrote just amounted to y = x.

improbable22 · on May 3, 2018

Clearly y'(0) = 0 here, since at x=0 the function is defined to be a constant.

In calculus class this would be ill-defined / ambiguous, but the computer is simple-minded and picks one answer.

This doesn't seem like a major flaw to me. What are you using this for which would care? Maybe you have something like a rectification which produces exactly x=0 much more often than 1 in 2^32... in which case a function like your y amounts to a special case for this.

If you cared you could write a special case for the derivative too, here's one which will give y'(0)=42 but otherwise the automatic derivative:

    julia> y(x) = x>0 ? x : -2x
    julia> y(x::Dual) = x==0 ? dual(0, 42) : (x>0 ? x : -2x)

JadeNB · on May 3, 2018

> Clearly y'(0) = 0 here, since at x=0 the function is defined to be a constant.

Every function is constant at every point. For example, I could define the identity function as `y ( x ) = if ( x > 0 ) { x } elsif ( x < 0 ) { x } else { 0 }`, but that doesn't mean that its derivative at the origin is `0`.

improbable22 · on May 3, 2018

No, dataflow's function y is defined in three smooth pieces, one of which is a constant. That's what this differentiation is seeing. It is evaluating at x+ϵ, and at 0+ϵ takes the ==0 branch which returns always 0, which has gradient 0.

Your function also has three pieces, and the only difference from `y ( x ) = x` is that you've chosen to specify a derivative at zero, i.e. y(x+ϵ) = if ( x==0 ) { 0+0ϵ } else { x+1ϵ }. Which you may have reason to do, say to tell your optimiser to stop pushing on the wall.

ChrisRackauckas · on May 3, 2018

Yeah, this is a good case to use multiple dispatch to flexibly tie into systems built on generic code. The problem with autodiff is it doesn't necessarily take all branches to know how to deal with these kinds of issues, but having the power to fix it yourself is crucial.

seanmcdirmid · on May 3, 2018

If one defines a subset DSL that can be auto differentiated via a symbolic method, then is it really that different? Embedded DSLs are pretty common in the Haskell world anyways.

noelwelsh · on May 3, 2018

Symbolic differentiation doesn't work for loops and conditionals. The expression of the derivative is also exponential in the size of the source expression.

seanmcdirmid · on May 3, 2018

Ok, I got it. Just as a clarification, Conal is talking about auto differentiation also. This was my mistake in presenting his work.

alanbernstein · on May 3, 2018

That's neat. I'm not sure I see the connection, other than the use of differentiation, broadly speaking.

vutekst · on May 3, 2018

Article is a bit broken for me because while the page is https, the mathjax file is loaded from a plain http. This is apparently forbidden by current Chrome and Firefox, so the figures and LaTeX don't work. edit: ah-hah, HTTPS Everywhere must have fiddled with it.

catnaroek · on May 3, 2018

I don't understand in what sense programs can be called “differentiable”. Is the space of programs modulo observational equivalence a manifold to begin with? (I don't think it's Hausdorff or even T1, but I could be wrong.)

The examples given in the article are merely derivatives of ordinary mathematical functions defined by ordinary mathematical expressions - in particular, there are no sequencing, no conditionals and no loops. So why call them “differentiable programs” when you are actually dealing with ordinary differentiable functions from good old 19th century analysis? We need urgent improvements in the intellectual honesty department.