Hard disagree. Phones and computers do text/images/video. Voice input and audio output is a poor substitute for text and not at all a replacement for images/video.
I didn't say there wouldn't be a screen. But natural conversations will be much more fluid and efficient than a keyboard and google search. Having a conversation is so much better for all sorts of applications.
Today's smartphones have to evolve this way, imo. I don't know what the most efficient hardware realization would look like but I imagine it's something that isn't in your pocket most of the time, more of a sleek wearable. It will need to be able to hear and see what you do.
I've been testing Android's dictation stuff while speaking quietly and it works better than I expected. (That is, the mistakes it makes seem to be the same if I were to talk louder; some common repeating misunderstandings, some things where it's thrown off by my accent.) Having the mic close by makes recording a quiet voice much more tractable than hearing it in conversation, we humans tend to stand further apart and mics are better than ears by now.
Not that I'd be advocating this for library study halls!