I've often wondered about the theoretical limit of a neural net to learn from examples - seems like a fascinating subject with a lots of implications. From a quick search, I found this paper:
https://experts.illinois.edu/en/publications/computational-l...
which is already very interesting. Are there any other good pointers on this?
According to the universal approximation theorem [1], a multilayer feedforward net with a single hidden layer can approximate any continuous function on R^n. There's also a nice visual proof. [2]
Are there also theorems about what functions can be learned, or how quickly they can be learned, by a multilayer feed forward neural net using backpropagation?
The short story is that VC dimension describes "worst case" behavior of a learning algorithm. This is a good predictor of performance on really hard problems, but lots of modern problems turn out to be theoretically easy. For example, neural networks trained on images work much better than we would guess based on the VC dimension of neural networks. This is because the distribution of images is closer to a best case scenario than a worst case scenario.
Lots of research is being done to determine when we are in a best case scenario. The "simplest" is called the Rademacher complexity. (See Chapter 26 of the UML book.) Simplest is in quotes because most phd-level students in machine learning that I've met don't even understand what a rademacher complexity is unless they are specializing in theory. And the rademacher complexity doesn't really even come close to capturing why neural networks work well on images either.
No, it does not even guarantee an upper bound on the number of hidden units required to approximate the function with a certain accuracy (even if it can be learned). So its usefulness is very limited.