This paper builds off of DeepMind's previous work on differentiable computation:...

modeless · on Oct 13, 2016

The reason other researchers haven't jumped on NTMs may be that, unlike commonly-researched types of neural nets such as CNNs or RNNs, NTMs are not currently the best way to solve any real-world problem. The problems they have solved so far are relatively trivial, and they are very inefficient, inaccurate, and complex relative to traditional CS methods (e.g. Dijkstra's algorithm coded in C).

That's not to say that NTMs are bad or uninteresting! They are super cool and I think have huge potential in natural language understanding, reasoning, and planning. However, I do think that DeepMind will have to prove that they can be used to solve some non-trivial task, one that can't be solved much more efficiently with traditional CS methods, before people will join in to their research.

Also, I think there's a possibility that solving non-trivial problems with NTMs may require more computing power than Moore's law has given us so far. In the same way that NNs didn't really take off until GPU implementations became available, we may have to wait for the next big hardware breakthrough for NTMs to come into their own.

empath75 · on Oct 13, 2016

The brain is not a single universal neural network that does everything well. It's a collection of different neural networks that specialize in different tasks, and probably use very different methods to achieve them.

It seems like the way forward would be networking together various kinds of neural networks to achieve complex goals. For example, an NTM specialized in formulating plans that has access to a CNN for image recognition, and so on.

vinay427 · on Oct 13, 2016

This is being done using various types of networks. See these slides on image captioning by Karpathy for an example using a CNN and RNN: http://cs.stanford.edu/people/karpathy/sfmltalk.pdf

Senji · on Oct 13, 2016

If we're going with a brain metaphor. What would be the those neural networks' version of synesthesia?

empath75 · on Oct 13, 2016

Feeding mp3s to an image recognition neural net. And as soon as I typed that, I want to try it.

modeless · on Oct 13, 2016

Actually, in the architecture you described, if there is a planning net that's connected to image net and an audio net, rather than feeding audio to the image net I think synesthesia would be better modeled by feeding the output of the audio net into the image net's input on the planning net. If that makes sense.

Senji · on Oct 13, 2016

Not the output. Making several single connections from intermediate layers from the different nets.

dharma1 · on Oct 13, 2016

CNNs can actually be used for audio tasks too, on spectrograms

Senji · on Oct 14, 2016

It's how some guys defeated the first iteration of recaptcha's audio mode. Then google replaced it with something very annoying to use even for humans.

svantana · on Oct 13, 2016

They sure put a lot of focus on "toy" problems such as sorting and path planning in their papers - perhaps because they are easy to understand and show a major improvement over other ML approaches. IMHO they should focus more on "real" problems - e.g. in Table 1 of this paper it seems to be state of the art on the bAbl tasks, which is amazing.

gradys · on Oct 13, 2016

At least some of the "toy" problems aren't chosen just for being easy to solve or understand. They're chosen for being qualitatively different than the kinds of problems other neural nets are capable of solving. Sorting, for example, is not something you can accomplish in practice with an LSTM.

Mainstream work on neural nets is focused on pattern recognition and generation of various forms. I don't mean to trivialize at all when I say this - this gives us a new way to solve problems with computers. It allows us to go beyond the paradigm of hand-built algorithms over bytes in memory.

What DeepMind is exploring with this line of research is whether neural nets can even subsume this older paradigm. Can they learn to induce the kinds of algorithms we're used to writing in our text editors? Given this goal, I think it's better to call problems like sorting "elementary" rather than "toy".

sherjilozair · on Oct 13, 2016

bAbI isn't really a "real" problem either, although somewhat better than sorting and the like. bAbI works with extremely restrictive worlds and grammar. In contrast, current speech recognition, language modeling, and object detection do quite well with actual audio, text, and pictures.

I think the strength of NTMs will be best demonstrated by putting it to work on a long-range language modeling task where you need to organize what you read so that you can use it to predict better a paragraph or two later. Current language models based on LSTM are not really able to do this.

ludoplex · on Oct 15, 2016

Any chance you could link a pdf of the paper for us?

TeeWEE · on Oct 13, 2016

Once you have a learning machine that can solve simple problems. You can scale it up to solve very complex problems. Its a first step to true AI imho. Al lot of small steps are needed to go towards this goal. Integrating Memory & Neural Nets is a big step imho.

chriswarbo · on Oct 13, 2016

> Once you have a learning machine that can solve simple problems. You can scale it up to solve very complex problems.

Nope. It's really easy to solve simple problems; it can sometimes even be done by brute-force.

That's what caused the initial optimism around AI, e.g. the 1950s notion that it would be an interesting summer project for a grad student.

Insights into computational complexity during the 1960s showed that scaling is actually the difficult part. After all, if brute-force were scalable then there'd be no reason to write any other software (even if a more efficient program were required, the brute-forcer could write it for us).

That's why the rapid progress on simple problems, e.g. using Eliza, SHRDLU, General Problem Solver, etc. hasn't been sustained, and why we can't just run those systems on a modern cluster and expect them to tackle realistic problems.

visarga · on Oct 13, 2016

Deep mind is breaking new ground in number of directions. For example, "Decoupled Neural Interfaces using Synthetic Gradients" is simply amazing - they can make training a net async and run individual layers on separate machines by approximating the gradients with a local net. It's the kind of thing that sounds crazy on paper, but they proved it works.

Another amazing thing they did was to generate audio by direct synthesis from a neural net, beating all previous benchmarks. If they can make it work in real time, it would be a huge upgrade in our TTS technology.

We're still waiting for the new and improved AlphaGo. I hope they don't bury that project.

shostack · on Oct 13, 2016

I'm not super knowledgeable about the space, but would the audio generation you mentioned be what is needed to let their Assistant communicate verbally in any language, any voice, add inflections, emotion, etc. without needing to pre-record all the chunks/combinations?

outsideline · on Oct 13, 2016

Decoupled Neural Interfaces using Synthetic Gradients is a fancy name for the electro-chemical gradient that lies outside the cell wall of neurons : https://en.wikipedia.org/wiki/Electrochemical_gradient

It's decoupled yet stores transient local information regarding previous neuron activity.

Another bio-inspired copy-pasta.

vintermann · on Oct 13, 2016

You should absolutely get a job doing it, if you think bio-inspired copy-pasta is all it takes. May I recommend Numenta?

xpe · on Oct 13, 2016

Please choose derogatory phrases like 'copy pasta' intentionally and carefully.

Many algorithms are bio-inspired -- good artists borrow, the best steal.

wiz21c · on Oct 13, 2016

>> DeepMind is simply operating on another level.

Would you be so kind as to to explain what you mean here ?

Thanks !

outsideline · on Oct 13, 2016

They're taking features that are present in the brain that aren't modeled and are making computational models for them. They're not a gold standard. You can create your own in under an hour. It's not another level. It's bio-inspired computing.

Here.. take the 'Axon Hillock' https://en.wikipedia.org/wiki/Axon_hillock code up a function for it, attach it to present day neuron models, make it do something fancy, write a white-paper and kazaam you're operating on another level..

Get it?

wiz21c · on Oct 13, 2016

ok I get it :) Nice little sarcasm, I'm loving it :-)

outsideline · on Oct 13, 2016

Alan Turing's tape machine + neuron model.

In the human brain, Neurons store an incredible amount of information. Neuron models in neural networks only did so with weights.

There is still a lack of understanding on how the human brain does it. Deep Mind grabbed a proven memory model from Alan Turing's work and applied it to the feature barren neuron models in use. Sprinkle magic ...

They are not operating on another level, they're bringing over features that are well documented in the human brain and in white papers from a past period when people actually thought deeply about this problem and applying it.

https://en.wikipedia.org/wiki/Bio-inspired_computing

There is no 'intuition' about the architecture. Study the human brain and copy pasta into the computing realm.

Others are doing this as well. If anyone bothered to read the white papers people publish, you'll see that many people have presented similar ideas over the years.

You can come up with your own neural Turing machine. Take a featureless neuron model, slap a memory module on it and you have a neural turing machine.

vintermann · on Oct 13, 2016

In order to use a turing machine in a neural network - or at least to train it, in any way that isn't impractical and/or cheating - you need to make it differentiable somehow.

Graves and co. have been really creative in overcoming problems in their ongoing program to differentiate ALL the things.

igravious · on Oct 13, 2016

In this context what does differentiable mean?

gugagore · on Oct 13, 2016

I think the easiest way to see this is by an example of a non-differentable architecture.

Let's suppose on the current training input, the network produces some output that is a little wrong. It produced this output by reading a value v at location x of memory.

In other words, output = v = mem[x]

It could be wrong because the value in memory should have been something else. In this case, you can propagate the gradient backwards. Whatever the error was at the output, is also the error at this memory location.

Or it could be wrong because it read from the wrong memory location. Now you're a bit dead in the water. You have some memory address x, and you want to take the derivative of v with respect to x. But x is this sort of thing that jumps discretely (just as an integer memory address does). You can't wiggle x to see what effect it has on v, which means that you don't know which direction x should move in in order to reduce the error.

So (at least in the 2014 paper, ignoring the content-addressed memory), memory accesses don't look like v = mem[x]. They look like v = sum_i(a_i * mem[i]). Any time you read from memory, you're actually reading all the memory, and taking a weighted sum of the memory values. And now you can take derivatives with respect to that weighting.

To me, the question this raises is, what right do we have to call this a Turing machine. This is a very strong departure from Turing machines and digital computers.

iandanforth · on Oct 13, 2016

Turing didn't specify how reads and writes happened on the tape. For the argument he was making it was clearer to assume there was no noise in the system.

As for "digital" computers remember they are built out of noisy physical systems. Any bit in the CPU is actually a range of voltages that we squash into the abstract concept of binary.

gugagore · on Oct 14, 2016

I don't think that is really relevant to the discussion. Regardless of how a digital computer is physically implemented, we use it according to specification. We concretize the concept of binary by designing the machine to withstand noise. The thing what we get when we choose the digital abstraction is that this is actually realistic. Digital computers pretty much operate digitally. Corruption happens, but we consider that an error, and we try to design so that a programmer designing all but the most critical of applications, should assume that memory does not get corrupted

We don't squash the range of voltages. The digital component that interprets that voltage does the squashing. And we design it that way purposefully. https://en.wikipedia.org/wiki/Static_discipline

Turing specified that the reads and the writes are done by heads, which touch a single tape position. You can have multiple (finitely many) tapes and heads, without leaving the class of "Turing machine". But nothing like blending symbols from adjacent locations on the tape, or requiring non-local access to the tape.

igravious · on Oct 13, 2016

No wonder Google built (is building) custom accelerators in hardware. This points to a completely different architecture from Von Neumann, or at least it points to MLPUs, Machine Learning Processing Units.

shostack · on Oct 13, 2016

Pardon my ignorance as I'm not super knowledgeable on this, but is what you described around reading all the memory and taking the weighted sum of values similar in a sense to creating a checksum to compare something against?

gugagore · on Oct 14, 2016

I suppose I can see the similarity, in that there's some accumulated value (the sum) from reading some segment of memory, but otherwise I don't think the comparison is helpful.

ebalit · on Oct 13, 2016

It means it can be trained by backpropagating the error gradient through the network.

To train a neural network, you want to know how much each component contributed to an error. We do that by propagating the error through each component in reverse, using the partial derivatives of the corresponding function.

TeeWEE · on Oct 13, 2016

Dont forget you first need to understand the mathemetical theory of how a brain does computation and pattern recognition. Off course they look into how a brain does it. But the mathematical underpinnings, and how the information flows is much more important than how an individual neuron works in real live. Abstraction and applying it to real data is what they are doing.

usgroup · on Oct 13, 2016

Any chance you could fix this statement:

Input = Data

Process = Optimisation to create an automata.

Output = Automata

Computer power means much larger variable spaces can be handled in optimisation problems. NN are a means to prune the variable space during optimisation in a domain unspecific way.