Have not read the paper in much depth yet but this looks like great work, super interesting. Thanks for sharing.
Question: in the example of prediction on untrained tasks, what exactly hasn't been trained? The paper talks about video being one of the trained tasks. Did you simply retrain model without video examples and then test performance?
The model was trained on video classification, image qa and image captioning. Video captioning and video qa is not trained, yet the model shows results on those tasks.
Can you explain in a bit layman terms what exactly has been done here ? What I understood is that there is a single NN trained for multiple tasks - but what is the benefit ?
As an author of this paper, I feel such neural networks are indeed a small step towards what we call AGI. By learning shared representations across a variety of tasks makes an AI system more robust to real-world datasets and makes it easy to adapt to new tasks without having to learn everything from scratch (which we humans are naturally capable of doing)
Preferred original title when submitting links: "Show HN: OmniNet - A unified architecture for multi-modal multi-task learning"
HN is a bit strict. I'd say "X is all you need" gets less attention from users than a very technical headline. The most popular submission recently had MITM in the title (https://news.ycombinator.com/best)
Question: in the example of prediction on untrained tasks, what exactly hasn't been trained? The paper talks about video being one of the trained tasks. Did you simply retrain model without video examples and then test performance?