I've often wondered about the theoretical limit of a neural net to learn from ex...

bmc7505 · on Aug 18, 2017

According to the universal approximation theorem [1], a multilayer feedforward net with a single hidden layer can approximate any continuous function on R^n. There's also a nice visual proof. [2]

[1]: http://www.sciencedirect.com/science/article/pii/08936080899...

[2]: http://neuralnetworksanddeeplearning.com/chap4.html

drdeca · on Aug 18, 2017

Are there also theorems about what functions can be learned, or how quickly they can be learned, by a multilayer feed forward neural net using backpropagation?

jackpirate · on Aug 18, 2017

Yes, they use what's called the VC-dimension. Chapter 20 of the book Understanding Machine Learning (available for free online http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning...) has a good introduction to the basics of this theory.

The short story is that VC dimension describes "worst case" behavior of a learning algorithm. This is a good predictor of performance on really hard problems, but lots of modern problems turn out to be theoretically easy. For example, neural networks trained on images work much better than we would guess based on the VC dimension of neural networks. This is because the distribution of images is closer to a best case scenario than a worst case scenario.

Lots of research is being done to determine when we are in a best case scenario. The "simplest" is called the Rademacher complexity. (See Chapter 26 of the UML book.) Simplest is in quotes because most phd-level students in machine learning that I've met don't even understand what a rademacher complexity is unless they are specializing in theory. And the rademacher complexity doesn't really even come close to capturing why neural networks work well on images either.

leod · on Aug 18, 2017

True, but that does not show that the approximation can always be learned from examples, does it?

bmc7505 · on Aug 18, 2017

No, it does not even guarantee an upper bound on the number of hidden units required to approximate the function with a certain accuracy (even if it can be learned). So its usefulness is very limited.