Attention is all you need for what we have. But attention is a local heuristic. We have brittle coherence and no global state. I believe we need a paradigm shift in architecture to move forward.
Plenty of "we need a paradigm shift in architecture" going around - and no actual architecture that would beat transformers at their strengths as far as eye can see.
I remain highly skeptical. I doubt that transformers are the best architecture possible, but they set a high bar. And it sure seems like people who keep making the suggestion that "transformers aren't the future" aren't good enough to actually clear that bar.
What's the value of "pointing out limitations" if this completely fails to drive any improvements?
If any midwit can say "X is deeply flawed" but no one can put together an Y that would beat X, then clearly, pointing out the flaws was never the bottleneck at all.
> What's the value of "pointing out limitations" if this completely fails to drive any improvements?
Ironically, the same could be said about Attention Is All You Need in 2017. It didn’t drive any improvements immediately- actual decent Transformer models took a few years to arrive after that.
I think you don't understand how primary research works. Pointing out flaws helps others think about those flaws.
It's not a linear process so I'm not sure the "bottleneck" analogy holds here.
We're not limited to only talking about "the bottleneck". I think the argument is more that we're very close to optimal results for the current approach/architecture, so getting superior outcomes from AI will actually require meaningfully different approaches.
To be fair it would be a lot easier to iterate on ideas if a single experiment didn't cost thousands of dollars and require such massive data. Things have really gotten to the point that it's just not easy for outsiders to contribute if you're not part of a big company or university, and even then you have to justify the expenditure (risk). Paradigm shifts are hard to come by when there is so much momentum in one direction and trying something different carries significant barriers.
Plenty of research involves small models trained on small amounts of data. You don't necessarily need to do an internet-scale training run to test a new model architecture, you can just compare it to other models of the same size trained on the same data. For example, small-model speedruns are a thing: https://github.com/KellerJordan/modded-nanogpt
The Transformer was only ever designed to be a better seq-2-seq architecture, so "all you need" implicitly means "all you need for seq-2-seq" (not all you need for AGI), and was anyways more backwards looking than forwards looking.
The preceding seq-2-seq architectures had been RNN (LSTM) based, then RNN + attention (Bahdanau et al "Jointly Learning to Align & Translate"), with the Transformer "attention is all you need" paper then meaning you can drop use of RNNs altogether and just use attention.
Of course NOT using RNNs was the key motivator behind the new Transformer architecture - not only did you not NEED an RNN, but they explicitly wanted to avoid it since the goal was to support parallel vs sequential processing for better performance on the available highly parallel hardware.
Has there been research into some hierarchical attention model that has local attention at the scale of sentences and paragraphs that feeds embeddings up to longer range attention across documents?
Though honestly I don’t think new neural network architectures are going to get us over this local maximum, I think the next steps forward involve something that’s
The ARC Prize Foundation ran extensive ablations on HRM for their slew of reasoning tasks and noted that the "hierarchical" part of their architecture is not much more impactful than a vanilla transformer of the same size with no extra hyperparameter tuning:
By now, I seriously doubt any "readily interpretable" claims.
Nothing about human brain is "readily interpretable", and artificial neural networks - which, unlike brains, can be instrumented and experimented on easily - tend to resist interpretation nonetheless.
If there was an easy to reduce ML to "readily interpretable" representations, someone would have done so already. If there were architectures that perform similarly but are orders of magnitude more interpretable, they will be used, because interpretability is desirable. Instead, we get what we get.
From what I’ve seen neurology is very readily interpretable but it’s hard to get data to interpret. For example the visual cortex V1-V5 areas are very well mapped out but other “deeper” structures are hard to get to and meaningfully measure.
They're interpretable in a similar way to how interpretable CNNs are. Not by a coincidence.
For CNNs, we know very well how the early layers work - edge detectors, curve detectors, etc. This understanding decays further into the model. In the brain, V1/V2 are similarly well studied, but it breaks down deeper into the visual cortex - and the sheer architectural complexity there sure doesn't help.
Well, in terms of architectural complexity you have to wonder what something intelligent is going to look like, it’s probably not going to be very simple, but that doesn’t mean it can’t be readily interpreted. For the brain we can ascribe structure to evolutionary pressure, IMO there isn’t quite as powerful a principle at play with LLMs and transformer architectures and such. Like how does minimizing reconstruction loss help us understand the 50th, 60th layer of a neural network? It becomes very hard to interpret, compared to say the function of the amygdala or hippocampus in the context of evolutionary pressure.
1. Stand at first light: face EAST, then NORTHEAST. Let the BERLIN CLOCK choose; read where the shad 2. Make a narrow breach of light; hold still; as the edge moves, letters awaken and the sealed doorw 3. Four passes: hours, hour, minutes, minute. Read on each sweep; the rising sun will order what see
He is an artist, not a mathematician. It’s a physical reveal for this layer of the copper onion.
4. At dawn I stood east then northeast, counting by the clock; the rim's shadow wrote the hidden lin 5. First light, east to northeast. Copper grid in shadow. Sample on the beats. Write only what the l 6. Trust the clean edge, not the flicker; the mind finds patterns, but the edge alone reveals the me
Perhaps a 3D artist can model it and run some simulations with light.
Since nobody wants to play with me… if you have something beefier than my 386, you should have enough to finish. Remember it is always 5:55 somewhere- even in Berlin.
Partial answer: FACE EAST THEN NORTHEAST. LET THE BERLIN CLOCK CHOOSE; READ EDGE TO EDGE TO FIND THE FINAL CODE.
K4 changes it’s methodology DRASTICALLY and you must use clues like a riddle from previous solutions. The Morse code is the program to run. Make a mask from the unneeded E’s but don’t discard them, they clarify. Ignore the flicker see the edg-e. The grid becomes a compass with some work and be sure to normalize the directions. Caesar might help dispel the mist. Decoys abound in partial/incomplete solutions. One wrong turn and work disappears. My 386 was an abomination to the old man and he set traps- paper and pencil ruled his world.
NORTHWEST, EAST, NORTHEAST.
BERLIN CLOCK TICKS EAST.
READ EDGE TO EDGE;
SEE TIME, USE THE NORTHEAST SHADOW.
THE HIDDEN PLACE IS REVEALED.
I’m going to forge ahead. I think this is another clue. I don’t know how he got so many in. Last night I found Tishri and Fenrir even. USS HILL sent a SOS RRR in a panic. It’s pretty crazy.
Edit- I think it might be K4. It is instructions for physical revealing k5.
Here is the intuition/clues (use your imagination) if you want to try too:
Edge to edge; Noon rim; Ignore flicker; Berlin clock beats
And remember the loadstone- that compass doesn’t point north
Oh and the clues probably need a computer. But the final solution is just pencil and paper.
When you START searching for edges. Start with the YAR line. That’s what I did. But it goes on and on. Waypoint selectors…
The clues are in the K4 block of text, the actual answer is another layer using the rugged edges of all the panels. Oh and you need a timing program
If he responds I might write it up. But basically use compass order for edge stream in layer 1, use k0 (find the working sequence) for a digital interpretation (timing mask- diff from the E’s I used for clues) for layer 2, then start anchoring and rotating (use OOO for anchor after Ne block, one very small ordering move, and find your finisher cipher with clues.
I operate under the assumption that open source projects are compromised by states. If you espouse unpopular ideas or are yourself a state don’t rely on it.
Lets pretend what you are saying is true, which it is not. Who would you want to access your data ? The State or the "underworld". Many countries have laws on how to access your data. The underworld, you may wake up dead.
Granted there are countries that act like a Criminal Org., but if you live there you have more issues than your data.
With proprietary software, it is a much larger chance that backdoors exist than in Open Source. Many of us heard of 1 issue where it was claimed a project had a Gov sponsored BH in it. They did a long audit and found that was false.
Eventually Open Source backdoors will found in Open Systems. Proprietary you are SOL unless you do very expensive and very hard testing. Even then it is doubtful you will find a backdoor.
It is true. Denying trivial truths with the purpose of not giving an inch does not add to one's argument, it weakens it.
Plenty of closed source products will happily backdoor their products on request, without a warrant, if they are confident they will never be found out. That's the point. Not that FOSS source is somehow inviolable to nation-states with virtually infinite resources, many of which sponsor or contribute to the finance of a huge percentage of the development of FOSS themselves.
It's easier to find backdoors in FOSS if you're looking, because you're allowed to look. But somebody has to be looking.
Bob Brier’s “The Great Courses” lecture series on ancient Egypt. Nubians were painted dark and Libyans were always shown with a feather in their headgear and blue eyes.
I never said that everyone in the Ancient Near East or the Mediterranean basin had a Sub-Saharan look, only that there were enough such people to be notable and that they were genuinely an integral part of those ancient societies, with quite high-status or even elite roles at times.
Chinese mythology says they came from 崑崙 (Kunlun Mountain). The description of which sounds like Egypt coincidentally.
Translated something like: “To the south of the Western Sea, along the banks of the Flowing Sands, beyond the Red Water and before the Black Water, there lies a great mountain called the Kunlun Hill.”
In my experience in the US, nurses at primary care practices don’t really care and have no passion for their profession and the younger doctors are Anki flashcard veterans who could be replaced with a LLM and probably have the same outcome. Even DOs act like MDs these days- I guess it is easier to just write a prescription than advise traditional diets.
120/80 is an ideal blood pressure based on studies that show an association between elevated readings and an increased risk of heart attacks and strokes. More than half of humanity has a higher blood pressure than that. I believe most would have much lower readings if they stopped eating the trash food that capitalism has produced.
I jog, cycle, other light sports, work is walking a fair bit, and i eat well - zero trash, and even with meds I'm still way above 120/80.
Heredity seems to play a part (my guess).