moshe's comments

moshe · on Feb 7, 2017

Please note that the GitHub page says "not an official Google product", rather than "project". An official Google product would be something like gmail.

superfx · on Feb 8, 2017

I guess I was confused because the main TensorFlow page doesn't have that phrase. Made it sound like it won't be officially supported by Google.

moshe · on Feb 7, 2017

TensorFlow Fold provides a TensorFlow implementation of the dynamic batching algorithm (described in detail in our paper [1]). Dynamic batching is an execution strategy for computation graphs, you could also implement it in PyTorch or Chainer or any other framework.

Our particular implementation of dynamic batching uses the TF while loop, which means that you don't need to make run-time modification to the actual TF computation graph. At runtime, we essentially encode the computation graph for (let's say) a parse tree as a serialized protocol buffer (tf.string), so instead of varying the computation graph itself we vary the input to a static computation graph instead. This particular implementation strategy is very much a byproduct of how TensorFlow works (static computation graph, heavy lifting happens in ops implemented in C++).

[1] DEEP LEARNING WITH DYNAMIC COMPUTATION GRAPHS, https://openreview.net/pdf?id=ryrGawqex

taliesinb · on Feb 8, 2017

Congratulations on the nice work! It is very elegant to use combinators to formulate and solve this problem. Though my worry with combinators is that they introduce awkwardness with setting up 'long range' DAG edges, you can have to do a lot of shuttling of things around manually through tuples and whatnot, I'm not sure how it is with your framework.

Am I right in thinking that there is no bucketing going on here? In other words, each batch is fixed length, and a short-lived DAG is planned and then simulated with tf.while to accommodate the set of shapes in that particular batch? Are there any problems when the input shapes are wildly different in expense? For example, imagine a size-agnostic convnet. Maybe some of the images in the training set are small, others are large, how would that look in your framework, if it can be done? Is junk padding part of picture, to help match almost-equal tensors to allow them to be batched?

moshe · on Feb 8, 2017

Thanks, insightful questions.

You're absolutely right about combinators and long-range dependencies. We have a special block type in the high-level API (https://github.com/tensorflow/fold/blob/master/tensorflow_fo...) for accumulating results here without having to explicitly shuttle them around; not perfect but very handy in many cases.

Regarding your second question, the equivalent of padding in dynamic batching is "pass through" do-nothing ops, which are introduced transparently but worth being aware of if you want to understand the machinery. The worst case scenario here is chains of wildly varying lengths, where we need to add pass-throughs to the shorter chains to match the length of the longest chain.

liuliu · on Feb 8, 2017

From my understanding you cannot do the similar dynamic batching with PyTorch since it gets eagerly executed. Not sure about Chainer. Can you explain a bit more about how this can be applied to other frameworks?

moshe · on Feb 8, 2017

I've never used PyTorch or Chainer, so maybe I'm wildly wrong, but here goes..

Dynamic batching requires you to predefine a set of "batch-wise" operations (node types for computation graphs). This works fine for eager execution as well, because the eager code is inside a function definition. Of course if you want to do dynamic batching you need to introduce a layer of abstraction, e.g. instead of saying a+b you create a dynamic batching operation for addition and add edges to an execution graph, this is not different than in our TF implementation.

liuliu · on Feb 9, 2017

Thanks. Yeah, my understanding is that you cannot have eager execution all the way down, some eager executions inside the function is fine, but you shouldn't have these outside the function blocks.

Without this function block abstraction, it doesn't seem like you can implement dynamic batching very far though (or even at all).

moshe · on Jan 12, 2009

I have gotten plenty of great applicants - thanks guys! Please stop sending me your application info at this point...

moshe · on Dec 18, 2008

It is, actually.

moshe · on Dec 18, 2008

"Currently pursuing a BS, MS or PhD in computer science or a related technical field." is verbatim from the intern job listing (http://www.google.com/support/jobs/bin/answer.py?answer=1262...), and I suppose it is there because the target audience for Google Engineering internships is students who are aiming to work some place like Google Engineering after graduating - not exactly the same category as "technically qualified students". So if you'd consider working at some place Google Engineering post graduation, then you should definitely apply.

For the record, I would consider economics to be a related technical field (cf. John von Neumann, cf. "Evolution of Cooperative Problem-Solving in an Artificial Economy" - http://www.whatisthought.com/hayek32000.pdf)...

mstoehr · on Dec 18, 2008

The clearest description of what CS researchers talk about when they talk about "intelligence" (see http://www.vetta.org/definitions-of-intelligence/) is probably a production function from economics.

moshe · on Nov 13, 2008

But it is! I really am lucky enough to get to do machine learning research at Google and code in Common Lisp!

See http://research.google.com/. Another way you can tell that this is an official Google project is the 'Google' label on the right-hand side of http://code.google.com/p/plop/, which is only added to code developed at Google that has been open-sourced.

Cheers!

icey · on Nov 13, 2008

Maybe one day if you're bored, you could tell the rest of us how you were able to swing a job at Google Research working with Lisp?

I don't know about these guys, but I've always been under the impression that if you're at Google, you're using one of their "Big 4" languages.... Heck, even Norvig is using Python! (I know, he was using it before he went there... but still).

scott_s · on Nov 13, 2008

I can't speak for the whole process, but a requirement is a PhD in an area Google cares about.

The research divisions of large companies are generally different than the production side. Researchers generally have the freedom to choose how to implement something.

nihilocrat · on Nov 13, 2008

I figure Google is too smart to restrict its hiring to knowing a certain set of languages, especially for its research division. Languages can be quickly learned, sheer badassery cannot.

moshe · on Nov 13, 2008

a) Not everyone in Google Research has a PhD, though of course there is a bit of a correlation (also, plenty of "research" takes place in general engineering) b) You do generally have to be very familiar with one of the Big 4 to get hired (C++ in my case) c) Plop is kind of a weird edge case insofar as it is very long-term research, and really cries out to be implemented in a language where you have the same features at compile-time and run-time, and can conveniently represent and manipulate code as data (i.e. a Lisp). For doing plop in any other language, your first task would be to implement these features, which is rather a lot of work (believe me, I've tried ;->). For most anything else one of the Big 4 would be used, for the reasons Steve Yegge outlines...

cdibona · on Nov 13, 2008

Also, before anyone gets too excited, Google generally doesn't go beyond Java, C++ and Python. But we do have some special purpose languages internally, as well.

Also, I don't have a PhD :-)

chett · on Nov 13, 2008

If they're hiring lisp programmers ... I AM willing to relocate! Seriously.

jimbokun · on Nov 13, 2008

I think in one of Steve Yegge's epic narratives of life at Google, he said that only the "Big 4" languages can be deployed on Google's servers.

Does this mean, if you ever wanted your research to go "live" in a Google product, you would first have to rewrite in C++, Java, Python, or Javascript?

paulgb · on Nov 13, 2008

> Another way you can tell that this is an official Google project is the 'Google' label

Neat, that's nice to know.

moshe · on Nov 13, 2008

Eventually, learn arbitrary programs from data, e.g.

Input: (learn 'fib '(x) '(((1) 1) ((2) 1) ((3) 2) ((4) 3))) Output (defun fib (n) (if (< n 3) 1 (+ (fib (1- n)) (fib (- n 2)))))

as well as standard machine-learning tasks such as supervised classification.

For what it does right now, see the examples at the bottom of the quick start guide:

http://code.google.com/p/plop/wiki/QuickStart

For more technical background see e.g. http://metacog.org/main.pdf (my dissertation). I will also add a list of relevant publications to the wiki...

jderick · on Nov 13, 2008

Pretty interesting.. actually looks somewhat related to my thesis:

http://www.cs.utexas.edu/~jderick/thesis.pdf

I'll have to look at it more closely when I get a chance.

I was working on learning proofs, rather than programs, but I think there are a lot of similarities.

mnemonicsloth · on Nov 13, 2008

I was working on learning proofs, rather than programs, but I think there are a lot of similarities.

I'm just a mathematician, and my mind has obviously been warped by continua and the Axiom of Choice but aren't they the same thing?

http://en.wikipedia.org/wiki/Curry-Howard_correspondence

Actually, I've gotten interested in learning more about theoretical CS lately, but my everyone in my department is too old/applied/Russian to care, and my university's CS people only seem to care about operating systems and e-commerce. Can you recommend a logical starting place?

ntoshev · on Nov 13, 2008

The notation you'd use to represent the program/proof is very important.

jderick · on Nov 13, 2008

One interesting difference is that if you are learning a proof, then you know when you are successful. At least for my research, I generated a bunch of small cases of the theorem I was trying to prove, then tried to learn the general case. If I was successful, I would have a proof of the original conjecture. When learning programs from some data, you can always test your program against the data but never can be sure that you are generalizing the data properly.

sdp · on Nov 13, 2008

The equivalence seems to hold only when the program is complete. Thus, while all proofs can be expressed as programs, not all programs can be expressed as proofs.

13ren · on Nov 14, 2008

Can't you restrict the domain, and so that the program is complete (for that restricted domain)?

sdp · on Nov 14, 2008

I meant correct, not complete.

I refer to the article on wikipedia which states: "A converse direction is to use a program to extract a proof, given its correctness. This is only feasible if the programming language the program is written for is very richly typed: the development of such type systems has been partly motivated by the wish to make the Curry-Howard correspondence practically relevant."

sdp · on Nov 13, 2008

Programs are equivalent to proofs given their correctness, rather. Additionally, according to the wikipedia article, extracting a proof from a program requires a very richly typed programming language.

moshe · on Nov 13, 2008

Thanks for the pointer. I was actually just looking at ACL the other day - I have a friend here at Google who is investigating integrating it with plop as a 20% project. I'm very excited that your work combines theorem-proving with learning, kudos! Do you do anything probabilistic to decide which generalizations to try?

jderick · on Nov 13, 2008

It is more just a set of techniques for generalizing theorems. At this point in time I would say there are a few roadblocks to doing some kind of data-driven probabilistic search for inductive proofs (like ACL2 does). The first is that there just aren't that many generalization techniques already implemented to choose from. The second is that most provers don't even have the basic backtracking mechanisms built in to support that kind of behavior. Although this may seem surprising, the search space for a typical theorem is so huge (technically infinite, but practically speaking just too complex to bother with) that usually if you don't find a proof right away with a high-confidence technique, then you are probably just going to go off into the weeds.

Now that more processing power than ever is available, I'm sure this will gradually change. But a lot of theorem proving technology is very old (dates to the 70s) so it may take some time.

The main technique I came up with looks at a particular finite case of a conjecture and generates a proof for that case using simple rewriting. You then look for patterns in that proof and use them to create a generalization. As a simple example, imagine your trying to prove a theorem about lists. First, I consider the case where a list is only 4 elements. Since this is a finite case, I can prove this easily with rewriting. Now I look at that proof and notice that rule X was applied 4 times. If I do the same thing when the list has 5 elements, that rule is applied 5 times. So in the general case, apply the rule n times.

One novel thing about that I came across in this work is something I called "Hybrid proof terms". I found that it is very difficult to find patterns in sequences of terms (the typical way of representing a rewrite based proof). So I used this representation that factors out the things that do not change from one line of the proof to the next.

Feel free to email me if you would like to discuss any more details. I'm not really working on this project anymore but I still find it a fascinating area of study.

Also, for your friend who is thinking about working with ACL2, he should know that the acl2-help mailing list is quite friendly and helpful.

cabalamat · on Nov 13, 2008

The problem of working out what a program does given its input and output is a very general one. Maybe there could be a competition for programs that do this? (Like the Loebner Prize, but more sensible since the Loebner Prize doesn't really have anything to do with progress in AI). Or maybe such a competition already exists and I don't know about it.

Perhaps a generalisation of this could be used as an intelligence test for AI programs. (for criteria for such a test, see http://www.overcomingbias.com/2008/10/economic-defini.html )

moshe · on Nov 13, 2008

Well, the difficulty with this is that program induction is in fact too general to be very interesting - to create a competition you would need to publish some test problems, or describe a special distribution of "natural programs" that you especially cared about...

cabalamat · on Nov 13, 2008

> program induction is in fact too general to be very interesting

Intelligence is general and is capable of coping with novel problems. AI therefore invovles building machines which have that level of generality. To say this isn't interesting is to say AI isn't interesting.

If you published exactly the problems that were going to be in the test, programs would be written that would solve those problems but would be very poor at doing anything else.

Anyway here are some input-output pairs that I think would be suitable for a learning program:

(i) append 'x' to the end of the string:

r => rx 123 => 123x

Similarly other insertions, deletions, copies of characters or groups of characters.

(ii) learning Roman or Arabic numerals or other number-coding schemes:

/ => i // => ii /// => iii ///////// => ix

3 => xxx 7 => xxxxxxx 9 => xxxxxxxxx 11 => xxxxxxxxxxx

Ideally, once the program has learnt the above two functions it should have an understanding of the underlying concepts and therefore find it easier to learn functions such as:

iii => 3 xii => 12 xvi => 16

moshe · on Nov 13, 2008

> Intelligence is general and is capable of coping with > novel problems. AI therefore invovles building machines > which have that level of generality. > There is a bias when you think of "programs" to think of programs that a human would actually write or consider useful. But in fact, this is only a tiny (and nontrivial to specify) fraction of the space of "possible programs" - see for example _Foundations of Genetic Programming_, by Langdon and Poli, for some next explorations of general program space. Most possible programs are completely useless and uninteresting to humans (same with proofs, grammars, etc. etc.).

To do AI you have to realize that in the sentences: "Humans have general intelligence and can solve problems in many domains" and "Breadth-first search is a problem-general technique that will always find a solution if one exists" the word "general" does not mean the same thing at all! (cf. the huge literature on inductive bias in human cognition).

For a nice list of program induction problems solvable by search (with a system called ADATE), see http://www-ia.hiof.no/~rolando/Examples/index.html .

cabalamat · on Nov 13, 2008

You're right that most possible programs are useless and uninteresting. So I agree with you that if a learning program was given problems to solve that are generated by generating random programs, that wouldn't be interesting.

I am not suggesting doing that. What I am suggesting is that the problems-to-solve be generated by humans.

Incidently what I am suggesting isn't quite inductive programming but something slightly more general, in that the job of the problem-solver isn't to generate a program that solves the input/output pairs, but to guess what the next output will be depending on its input. (The problem-solver may or may not do this by internally creating a program that solves the problem).

Anyway I've done a quick write-up of what I propose: http://www.includipedia.com/wiki/User:Cabalamat/Function_pre...

cabalamat · on Nov 13, 2008

Oops the formatting got fucked up. What I meant was:

  r => rx
  123 => 123x

and:

  / => i 
  // => ii 
  /// => iii 
  ///////// => ix

  3 => xxx 
  7 => xxxxxxx 
  9 => xxxxxxxxx 
  11 => xxxxxxxxxxx

  iii => 3 
  xii => 12 
  xvi => 16

paraschopra · on Nov 13, 2008

So, its genetic programming.. right?

moshe · on Nov 13, 2008

Well, sort of. Genetic programming with probabilistic modeling and sampling instead of crossover/mutation (i.e. it an estimation-of-distribution algorithm), where programs are represented in reduced (normal) form and which variables to use in the probabilistic models is determined adaptively and changes as the search progresses... ;->

andreyf · on Nov 13, 2008

Genetic programming, with mutations guided by learned patterns from a corpus of data? (or that's what I imagine it to be)

moshe · on Nov 13, 2008

Actually, I work full-time for Google Research (and have written all of the plop code, so far). Google happens to have sponsored a bunch of summer-of-code students through opencog.org, one of whom I supervised, but he coded in C++.

I think that the idea of a Lisp system that learns Lisp programs via probabilistic modeling is intrinsically interesting regardless of who funds it, but that could be personal bias ;->.

enomar · on Nov 13, 2008

I meant no offense. The work does appear to be interesting, but not for the reason indicated in this post's title.

moshe · on Nov 13, 2008

None taken. Thanks!

frisco · on Nov 13, 2008

moshe: by the way, you have the greatest email address ever.

staticshock · on Nov 13, 2008

madscience at google com? wow, i second that sentiment

moshe · on Nov 13, 2008

Heh, thanks. moshe at google dot com was already taken, it was my second choice...