Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Over the last few years I’ve become exceedingly aware at how insufficient language really is. It feels like a 2D plane and no matter how many projections you attempt to create from it, they are ultimately limited in the fidelity of the information transfer.

Just a lay opinion here but to me each mode of input creates a new, largely orthogonal dimension for the network to grow into. The experience of your heel slipping on a cold sidewalk can be explained in a clinical fashion, but an android’s association of that to the powerful dynamic response required to even attempt to recover will give a newfound association and power to the word ‘slip’.



This exactly describes my intuition as well. Language is limited by its representation, and we have to jam so many bits of information into one dimension of text. It works well enough to have a functioning society, but it’s not very precise.


LLM is just the name. You can encode anything into the "language" including pictures video and sound.


I've always been wondering if anyone is working on using nerve impulses. My first thought when transformers came around was if they could be used for prosthetics, but I've been too lazy to do the research to find anybody working on anything like that, or to experiment myself with it.


There are a few folks working on this in neuroscience, e.g. training transformers to "decode" neural activity (https://arxiv.org/abs/2310.16046). It's still pretty new and a bit unclear what the most promising path forward is, but will be interesting to see where things go. One challenge that gets brought up a lot is that neuroscience data is often high-dimensional and with limited samples (since it's traditionally been quite expensive to record neurons for extended periods), which is a fairly different regime from the very large data sets typically used to train LLMs, etc.


There are ‘spiking neural networks’ that operate in a manner that more closely emulates how neurons communicate. One idea I think that is interesting to think about is that we build a neural network that operates in a way that is effectively ‘native’ to our mind, so it’s less like there’s a hidden keyboard and screen in your brain, but that it simply becomes new space you can explore in your mind.

Or learn king fu.


Like Cortical labs? Neurons integrated on a silicon chip https://corticallabs.com/cl1.html


When you train a neural net for Donkeycar with camera images plus the joystick commands of the driver, isn't that close to nerve impulses already?


> I've always been wondering if anyone is working on using nerve impulses. My first thought when transformers came around was if they could be used for prosthetics

Neuralink. Musk warning though.

For reference, see Neuralink Launch Event at 59:33 [0], and continue watching through until Musk takes over again. The technical information there is highly relevant to a multi-modal AI model with sensory input/output.

https://youtu.be/r-vbh3t7WVI?t=3575


> You can encode anything into the "language

Im just a layman here, but i don't think this is true. Language is an abstraction, an interpreative mechanism of reality. A reproduction of reality, like a picture, by definition holds more information than it's abstraction does.


I think his point is that LLMs are pre-trained transformers. And pre-trained transformers are general sequence predictors. Those sequences started out as text or language only but by no means is the architecture constrained to text or language alone. You can train a transformer that embeds and predicts sound and images as well as text.


A picture is also an abstraction. If you take a picture of a tree, you have more details than the word "tree". What i think the parent is saying, is that all the information in a picture of a tree can be encoded in language, for example a description of a tree, using words. Both are abstractions but if you describe the tree well enough with text(and comprehend the description) it might have the same "value" as a picture(not for a human, but for a machine). Also, the size of the text describing the tree might be smaller than the picture.


> all the information in a picture of a tree can be encoded in language

What words would you write that would as uniquely identify this tree from any other tree in the world, like a picture would?

Now repeat for everything in the picture, like the time of day, weather, dirt on the ground, etc.


Great, but how do you imagine multimodal with text, video. Just 2 for simplicity, what will be in the training set. With text model tries to predict next, then more steps were added. But what to do with multimodal?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: