Neurons in Large Language Models: Dead, N-Gram, Positional

ACacctfortaday · on Sept 20, 2023

hmmm... For many decades researchers and experts have stated that we really don't understand what the heck is going on inside these artificial neuron networds... ....Sounds like the decades long quest to understand how a network of ArtificialNeurons processes input data into useful output (for example: categorizing, deciding yes or no, or controlling the actuators of robots, etc.) ...has been cracked. That is, through conceptually understanding linguistics and concepts like 'tokens' and 'parsing', the researchers have been able to trace through the steps of what the FFNs are logically doing. [Regarding other commenters' comments: Two points to keep in mind.... 1) any trained ANN (in machine learning, as well as FFNs) can be reduced to a single mathematical formula to replicate the processing being done by these ArtificialNeuralNets. In other words, after training, the result can be reduced to a single bit of math that alone can be reused to replicate the trained net WITHOUT using the actual ANN anymore - using only the ANN for training in order to derive a definite mathematical result for future use. {See most graduate school level textbooks on machine learning for the details....} 2) ..."sparse" does not equal "unnecessary"; it sounds like what others have suggested, it's like a decision tree rather than complexes of connections between artificial neurons doing heaven only knows what logically....]

~jgroch

pmarreck · on Sept 20, 2023

I guess the obvious next step is to cull the dead neurons from the model data? I imagine the brain does the same thing after the childhood years

kmeisthax · on Sept 20, 2023

Most models have a purely fixed architecture: i.e. if you train three layers of six feed-forward neurons a piece, your model and the training data it learns from will just fit within that architecture to the extent that gradient descent can force it to fit. There is no mechanism to say "oh these neurons will never activate, let's prune them", or "we'd have much better loss if we added a layer here".

In the dark times before PyTorch there was an idea called NEAT: "neuroevolution of augmenting topologies", which tried to use a genetic algorithm (i.e. testing a bunch of slightly modified solutions for loss) to discover both optimal weights and optimal network structure. I don't think this gets used all that often, if at all.

I hear stories every few years about how many models have unused neurons that can be pruned or how hyperparameter selection is a pain in the ass, but nothing about automating the pruning and parameter selection in such a way as to efficiently use the whole model. I'm not sure it's necessary anyway.

tysam_and · on Sept 20, 2023

DL researcher here, it's also hard I think because experimentally many of us have noted (also some research on this) that there's a critical early phase to learning that conditions the network a certain way, and adding and/or removing layers later seems to actually be quite difficult (removing easier than adding, by far, esp w/ something like variational dropout [yes, i cite the old deep magicks]: https://arxiv.org/abs/1701.05369)

hiddencost · on Sept 20, 2023

I had success doubling the width of a layer by copying, adding epsilon noise and then fine tuning.

Transform from [X] to [X X+epsilon].

In that case I was trying to fuse two signals that had fairly similar characteristics.

tysam_and · on Sept 20, 2023

Yes, unfortunately I have yet to beat that technique in wallclock convergence speed vs just using the larger network from the start. :'((((

Whoever figures out how to clearly and effectively do it consistently faster than a 'full network from the start' version will open up an entirely new subfield of neural network training efficiency. :'))))

sdenton4 · on Sept 20, 2023

There's a huge amount of work on model pruning, especially with an eye towards model reduction for on-device deployments. I've done a bunch of work in this space focused on getting speech synthesis working in real time on phones. It works and can be automated.

There's a lot of nuance though. What has typically worked best are variants on iterative magnitude weight pruning rather than activation pruning.

This can often get rid of 90% of the weights with minimal impact on quality. Structured pruning lets you remove blocks of weights while retaining good SIMD utilization... and so on.

radarsat1 · on Sept 20, 2023

> There is no mechanism to say "oh these neurons will never activate, let's prune them"

Not at training time. But after training you can evaluate which activations rarely fire and simply remove their columns from the weight matrix.

Or you could re-randomize them and continue training.

dTal · on Sept 21, 2023

A neuron which rarely fires may be responding to a rare feature, no?

radarsat1 · on Sept 21, 2023

Absolutely. Pruning can decrease overall accuracy. The challenge is to prune as aggressively as possible while keeping accuracy as high as possible. It's certainly not an operation that is expected to have no effect.

It should be noted that there are methods for training the network to encourage these prunable neurons, eg sparsity penalties.

spyder · on Sept 21, 2023

There are already methods to do that in neural networks, it's called "pruning"

https://en.wikipedia.org/wiki/Pruning_(artificial_neural_net...

and, yes the article also mentions that it happens in mammal brains.

pushfoo · on Sept 20, 2023

If I read the paper correctly, it seems to support the old quip about all AI being decision trees, at least for smaller model sizes.

It also raises an interesting UX question: is there an implicit tradeoff between legibility and power for notations as for different ways of expressing rotation, or is that a consequence of using graphics hardware to implement AI?

PaulHoule · on Sept 20, 2023

A RelU network looks a bit like a network of decision trees, that’s for sure.

jszymborski · on Sept 20, 2023

When I read "Dead Neuron" I definitely associated it with ReLUs.

It's claimed that this tendency to create dead neurons has a regularizing effect, but frankly I don't buy that.

Anecdotally, I've always had better results with Leaky ReLUs or Mish.

thatxliner · on Sept 21, 2023

Removing dead neurons is another way of reducing model size I’m assuming

thomasahle · on Sept 21, 2023

Or maybe one should keep track of "neuron activation frequency" during training. This would not a lot of extra parameters, since it's per neuron, not per weight. Then every epoch, or so, we'd "reinitialize" the dead neurons. This is similar to K-Means algorithms that reinitialize cluster centers that have very few assigned points.

FrustratedMonky · on Sept 20, 2023

Finding more and more similarities between computer based neural networks, and carbon based brain neural networks, everyday.

It is only scale at this point. Human level functioning is coming.

The bad news, humans are quirky and violent and don't have the best reinforcement learning themselves, and have poor alignment.

Basically humans are a low bar for AI to reach.

ben_w · on Sept 20, 2023

Perceptrons are, by design, analogous to organic neurones — only analogies, not fantastic models. Likewise the artificial networks are analogies to organic networks.

It's therefore not surprising that the artificial behaves analogously to the organic, but it would be a mistake to assume they reproduce us accurately: GPT-3 is about a thousand times less complex than a human, while trained on more text than any of us could experience in ten millennia and only text. It has no emotions, unless the capacity to simulate affect happened to lead to this as an emergent property by accident (which I can't rule out but don't actually expect to be true).

PaulHoule · on Sept 20, 2023

I find it pretty remarkable how in many visual recognition neural networks (say for the MNIST digits) you see neurons close to the input layer that respond similarly to neurons in the V1 area of the visual cortex.

http://www.lifesci.sussex.ac.uk/home/George_Mather/Linked%20...

cuteboy19 · on Sept 20, 2023

There's basically only one way to solve that problem. Are we surprised every time 2+2 is 4?

ben_w · on Sept 20, 2023

Did we have any reason to expect that in advance? This was the fist time we tried, in this metaphor, to calculate 2+2 in the abstract.

sdenton4 · on Sept 20, 2023

It's wavelets all the way down.

FrustratedMonky · on Sept 20, 2023

Sure, they are modeled. One is math, one is atoms.

But look at a single 'real' neuron, it is just calcium ions, electrical potentials, there is no 'emotion'.

Once you can completely model a 'real' neuron (I know there is still some scale to achieving this). Then link together these exactly modeled 'real' neurons. What is to say it is not experiencing 'emotions'. Even though it is silicon.

Humans give themselves too much credit for being special. "I feeeeeel, the computer can't". That is just not understanding yourself.

ben_w · on Sept 20, 2023

> But look at a single 'real' neuron, it is just calcium ions, electrical potentials, there is no 'emotion'.

Indeed; this is why I am willing to entertain the possibility that emotions may be present as an emergent property of simulating us. I don't expect it, but I can't rule it out.

__loam · on Sept 20, 2023

You clearly lack understanding here too. Simulating "real" neurons at the scale required to simulate a brain is probably np hard. Even if we wanted to try, we don't have any maps of neuron connectivity with nearly the resolution required to do so.

pushfoo · on Sept 20, 2023

> we don't have any maps of neuron connectivity

We do for a at least one classic, small-scale example: C. Elegans. Despite mapping the roughly 300 neurons, the simulation attempts I'm aware of weren't very fruitful.

> with nearly the resolution required to do so.

I agree this may be part of why. Accurate simulation may require replicating subtle behavior outside the neuron body. Further maps or simulation attempts may have since been made, and possibly with better results. Given I don't remember headlines about this, it's likely that any improvements weren't groundbreaking.

I don't know enough about the roles of glia and inter-neuron (not interneuron) behavior to discuss this further beyond wild speculation. Nor does anyone, as far as I know. Gaining that knowledge would probably be necessary to build connectomes with sufficient accuracy for simulation.

__loam · on Sept 21, 2023

That's exactly my point.

ben_w · on Sept 20, 2023

I think the bigger problem is we have no idea what the necessary or sufficient requirements are, neither for qualia nor for intelligence. (Not sure why you think it's np rather than just lots?)

With intelligence we can at least tell when we've achieved it, to whatever standard we like.

Emotions could probably be a thing where we can map some internal state to some emotional affect display, eventually; but what about any question of emotional qualia?

AFAICT, we don't even have a good grasp of the question yet. We each have access only the one inside our own head, and minimal capacity to even share what these are like with each other. When did you learn about aphantasia, for example? Is there a word for equivalents for that for other senses besides vision? I can "visualise" sounds and totally override the direction of down coming from my inner ears, but I can't "visualise" smells, and I don't have a non-visually oriented word for the idea, as both "visualise" and "imagine" clearly are and "idea" itself more subtly also is.

_0ffh · on Sept 20, 2023

"Qualia" are really just a reframing of (an aspect of) consciousness, which has been speculated to be purely epiphenomenal. Maybe we're just along for the ride, and our actions merely happen to mirror our decisions - or the other way around, same difference.

ben_w · on Sept 20, 2023

Qualia is how to discuss the problem of consciousness without a pointless discussion about if this is the opposite of unconscious, the opposite of subconscious, something that needs a soul, or any of the other things that go wrong if you don't taboo the "c" word.

We don't know the answers (nor, I assert, the correct questions), though "what even is this?" and "what has it?" were already relevant to animal rights questions (qualia might have started with humans, but there's no reason to assume that) well before current AI, and even if we find a convenient proof that current AI definitely can't and that's fine… some of us want to advance the AI to at least as far as brain uploading where the qualia is the goal, though virtual hells like in the novel Surface Detail seem to me to be an extremely plausible dystopian outcome if uploads ever actually happen.

FrustratedMonky · on Sept 20, 2023

Look up the paper on the neuron ability to do XOR logic. It is far less than NP hard.

ben_w · on Sept 20, 2023

Do you mean biological neurones or perceptrons? There have been publications about both regarding XOR. If the latter, be aware that this was only about single layers and that perceptrons have an unnecessary restriction on the way they can combine inputs.

FrustratedMonky · on Sept 21, 2023

No. Sometime last year there was paper measuring inputs and outputs on a real biological neuron and discovered it could do XOR logic.

Can do more logic also, that it could also do XOR was some new breakthrough.

And, not saying Neurons are just logic gates, it's just that they are something well below NP hard.

There are analog fluctuations in potential all along the neuron, and they peak and fire at different thresholds. Really, seems like 'weights' if we want to map terminology.

It does not seem like it will be that long before the human neuron can be 'modeled' accurately enough that quibbling over a few percentage points between the 'model' and 'reality' will not leave much room in-between to find a 'soul' or 'spark', or anything anybody wants to say makes humans special.

BobbyJo · on Sept 20, 2023

> GPT-3 is about a thousand times less complex than a human

Also note that it takes the 'brute force' approach to architecture by using a transformer model, basically learning a connection graph from scratch. If you want that to scale to human complexity in function, you're probably going to have to overshoot in size by an order of magnitude.

p1esk · on Sept 20, 2023

GPT-3 is about a thousand times less complex than a human

How did you figure that? How many synapses are dedicated to text processing in a human brain? How much information do those synapses encode compared to the information encoded in GPT-3? How about GPT-4?

Filligree · on Sept 20, 2023

I don't think GP was restricting 'human' to 'text processing elements of a human'.

At any rate, it's certainly true that GPT-3 is missing a great deal. So is GPT-4.

ben_w · on Sept 20, 2023

Correct; mine was an overall statement rather than purely about human language processing.

__loam · on Sept 20, 2023

Estimates for how much information the brain encodes are several orders of magnitude higher than the biggest llm, to the point where trying to replicate it is pushing the boundaries of computability. The brain is also significantly more adaptable and generalized thanks to neuroplasticity.

pmarreck · on Sept 20, 2023

> It is only scale at this point.

I see, this is basically the "AI of the gaps" argument. :P

It is NOT "only scale" at this point. But at the current rate, you will see this soon enough (I just happen to see it already). We'll have some very intelligent-seeming, very useful, but relatively uncreative zombies missing a "spark" (or any willpower, or any real source of what is valuable to them, or any sense of what is aesthetically pleasing or preferable or enjoyable or beautiful, or any consciousness for that matter). It will allow us to redefine what it means to be human. But our distinct human-ness will stand out even more at that point.

kelseyfrog · on Sept 20, 2023

Our distinct humanness will be found in the narcissism of small differences.

__loam · on Sept 20, 2023

The only narcissists here are the computer scientists who have convinced themselves they've made god.

FrustratedMonky · on Sept 20, 2023

Nobody is claiming they have made god today.

That is on the roadmap for 2030.

jeroenvlek · on Sept 20, 2023

Great work, please put this on a tile :D

FrustratedMonky · on Sept 20, 2023

LOL. Falling back on metaphysical arguments about 'sparks' and 'uniqueness', is NOT you 'seeing' something nobody else is able too.

fnordpiglet · on Sept 20, 2023

Agency is a key part of that spark, but we have done all sorts of research into agency and providing goal based agents into an AI model framework incorporating LLMs as well as other optimizers and solvers I think will provide the majority of that spark. The process of creativity depends on both internal agency and goal setting unmoored by external dictation and semantic synthesis of abductively reasoned concepts, with an aesthetic that feels into the goal based optimizer. These are things that can be simulated to the point that while there may be an uncanny valley somewhere it’ll be close enough to be hard to distinguish.

But I do wonder if the practical utility of such an entity is worth the amount of effort and capital required to build and sustain it. I suspect it’ll be more a novelty than a practical tool.

visarga · on Sept 20, 2023

I think much of that spark is actually contextual language generation. We're relying on learned language patterns for most of our sparks.

PaulHoule · on Sept 20, 2023

I would point to the problem that chatbots fail not at having a “spark” but at things ordinary computer software does well. The other day somebody pointed out in an HN conversation that I had gotten the 1984 Superbowl confused with the 1986 Superbowl.

That’s a very human mistake, I’m sure somebody can tell you who played in every Superbowl and what the score was but people do misremember things frequently and we don’t call it a hallucination. (which is a defect in perception)

“Superhuman intelligence” is easy to realize for sports statistics if you do the ontology and data entry work and put the data in a relational or related database.

The thing is that chatbots get 90% accuracy for cases where you can get 99.99% accuracy (sometimes the data entry is wrong) with conventional technology. There is a kind of faith that we can go to 10^17 or 10^30 parameters or something and at some point perfect performance will “emerge” but no I think it is more like it will approach some asymptote, say 95% and you will try harder and harder and it will like pushing a bubble around other a rug. It’s a common situation in failing technology projects, quite well documented in

https://www.amazon.com/Friends-High-Places-Livingston-1990-1...

but boy people are seduced by those situations and have a hard time recognizing that they are in them.

In a certain sense chatbots already have superhuman powers of seduction that, I think, come from not having a “self” which makes mirroring easier to attain. People wouldn’t ordinarily be impressed by a computer program that can sort a list of numbers 90% correctly but give it the ability to apologize and many people will think it is really sincere and think it is really promising, it just needs a few trillion more transistors. (See the story Trurl’s Machine in Stanislaw Lem’s excellent Cyberiad except that machine is belligerent and not agreeable)

Now an obvious path is to have the chatbot turn a question into a SQL query and then feed the results into conversation and that’s a great idea and an active research area, but I’d point out the dialogues between Achilles and the Tortoise in

https://en.wikipedia.org/wiki/G%C3%B6del,_Escher,_Bach

which people mistakenly think is about symbolic A.I. but that is really about the problems of solving problems where the correct solution has a logical aspect. Even though logic isn’t everything, the formulation of most problems (like “Who won the soccer game at Cornell last night?”) is fundamentally logical and leads you straight to paradoxes that can have you forever pushing a bubble under the rug and thinking “just one more” little hack will fix it…

fnordpiglet · on Sept 20, 2023

LLMs are just one tool in a collection. Intelligence is based on many models, not just the language parts of our brain - and I expect AI to incorporate more models in a system approach. Why does it matter if LLMs are able to play chess at a grandmaster level or not? They can delegate the actual chess optimization problem to a chess optimizing program. While it’s interesting language alone is as powerful as it is, it’s very myopic to judge the tool alone and not as a part of a toolbox.

FrustratedMonky · on Sept 20, 2023

Exactly It is NOT all about LLM's. There are a lot of other successful models. From AlphaGo, to visions systems, robotics. LLM is just the latest shiny thing.

At some point they will all be tied together, and at that point it will start to look a lot more like sections of our brain, one is vision, one is language, one is movement. etc....

candiodari · on Sept 20, 2023

I think it's already been made clear that the main reason for the "asymptote" is wrong data input. These models attempt to learn from random internet text ... and this turns out to not be all that accurate.

Also, I've observed a model I was training having the same problem as I do myself. If I at any point learn wrong data, which happens of course, then getting that wrong data back out is very hard and requires 10 or 50 times the effort I spent learning the initial data. In fact I strongly suspect I never unlearn bad data, I just additionally learn "if I say X, it's wrong, say Y instead".

jacquesm · on Sept 20, 2023

Brains suck for exact work such as database work or precise calculations over longer chains. But they excel at approximate work, and that's a very useful skill to have as long as if you have to you can fire up the pencil and paper and do your precise calculations that way. And paper works fine for database work as well and will remember all of those sports stats for as long as you care (and even after you're dead).

Brains are so powerful because they are universal, they can use auxiliary data stores and co-processors just fine.

pmarreck · on Sept 21, 2023

So basically we have to give the LLM access to (both read and add to) a tool that can deal with structured knowledge/state strictly, same thing we have to do for humans- like calculators, databases, clocks/alarms, computer language executors… That way if we tell it “remember that my birthday is April 5” it can enter it into a calendar tool in such a way that it can quickly retrieve it later to confirm its “LLM guesswork” or get triggered a reminder of it on that date.

I’ve been experimenting with prompting to get GPT4 to realize it has a “memory” (just a flat file for now) which it can contextually retrieve and write to, coupled with a process that interprets any requests it makes of this “memory” and adds them to the conversation. Limited success so far. End goal is a “life agent” that does things like remind me of things in a human-like way, sums up my emails, etc.