What is the breakthrough? There is no mention of the problem solved, nor the accuracy of the quantum solution, nor the amount of classical computing resources needed for a non-quantum solution.
The engineering probably has some cool novelty: 39 qubit and 10 million layers is a very large circuit. But simulating a large circuit is very different to achieving a quantum computing breakthrough in CFD.
I do get where you are coming from. Indeed, it makes little sense to use Julia for lots of machine learning when PyTorch and Jax are just so good. And it sounds like you don't want to use Julia, so who am I to try and convince you? Python/R are capable languages.
But, there are still reasons I reach for Julia.
Interesting packages where I prefer Julia over Python/R: Turing.jl for Bayesian statistics; Agents.jl for agent-based modelling; DifferentialEquations.jl for ODE solving.
I would much rather data-munge tabular data in Julia (DataFrames.jl) than Python, though R is admittedly quite nice on this front.
Personally I reach for Julia when I want to use one of the previous packages, or something which I want to code up from scratch, where base Julia is much preferable to me than numpy.
Janet seems really tempting for tiny footprint, distributability etc.
But I'm currently leaning towards to Racket just because it would be more or less compatible with a whole host of Scheme books that I'd like to read (The Little Schemer/Typer/Learner, SICP, Functional Differential Geometry).
Does anyone familiar with Janet know if those books can be easily worked through with Janet for a newbie Lisper?
Racket is a great choice for learning but also very batteries included if you want to make real projects in it.
Janet has a more Clojure-inspired syntax but the semantics and general ideas should carry over. I think trying to work through the books in Janet would be a great extra challenge. You can always drop that and focus on Racket if it becomes it too much for you.
Why not try doing the exercises in Janet? It gives you the end-product goal so you don't have to waste energy on ideation, but having to figure out syntactic differences and compatible standard library functions and macros yourself really helps you understand a language top-to-bottom and is arguably the best way to learn. The way I learn languages all the time is to translate exercises on sites like HackerRank or exercism along with toy projects I already have.
I haven't used Janet but there were some notes in the author's ebook (https://janet.guide/) that suggested it had some small but key differences with a more "traditional" Lisp that might trip you up or at least cause friction. I'd suggest using Racket for those books.
With a little bit of effort you can make it work. But I believe Racket has specific language definitions for some of these books so you can follow them more or less seamlessly.
Two big hurdles for ML: a) explainability b) accountability.
ML-enhanced development neatly circumvents this: explainability and accountability is passed on to the developer. This includes bugs and license infringements.
I have no qualms about ML tools in development. But so long as the buck stops with me, I prefer to write from scratch.
Traditional chess engines excel at tactics. But most chess tactics are brute forceable, and traditional chess engines are normally brute force behemoths.
Large language models have tradtionally been very weak at brute force computing (just look at how bad they are at multiplicaion of large numbers). If it can somehow excel at something which typically involves computing power like chess tactics deep into a game well after a novel position has been reached, then put me down as impressed.
I'm impressed too, but only in the sense that I'm surprised that transformers can reach any level of success at all.
But it seems mostly like a weird statistical quirk of chess that somehow tactical sequences are the most common move sequences in chess(except for openings), and so they're very likely to pop out from a model predicting the most likely tokens. But there doesn't seem like there's a reasonable path from doing that to being able to determine whether a likely tactical sequence is actually a good idea. I've been thinking a lot about ways to use a transformer model in combination with search to accomplish this for a few days now. So far I have a few ideas, most of them are a lot less revolutionary than I was hoping for.
You could take a (narrowly trained) transformer model, give it the game so far and have it predict a sequence of moves. Then use those moves as a move ordering heuristic in a good old stockfish like architecture. Essentially do the normal Alphabeta search but look at the suggested moves first at each node, indexed by ply. I could imagine if the suggestions get really good, this might prune a decent amount of nodes. But nothing earth-shattering, maybe 50 elo gains at most is my intuition there. I have other non-transformer ideas that I think are more worthwhile to work on for now.
The other is to instead invoke it at every node asking for one move only, and guide a search that way. But this somehow just feels like a reinvention of LCZero.
There's a third more speculative idea which would involve a novel minimax search algorithm inspired by how humans think in terms of tactical sequences, but this idea is still so vague in my head I'm not even sure how to coherently describe it, let alone implement it, or whether it makes sense at all.
I still need to think about this more deeply, break out my whiteboard and play with some minimax trees to flesh it out. It is intriguing though. I'd also have to train my own transformer; I see no reason to actually end up using GPT for this. Seems to be no sense including the entire internet in your data if all you're doing is predicting sequences of algebraic notation.
If you just want to learn a model of chess games in algebraic notation (is that what it's called? I don't play chess) then you don't need to train a Transformer. That would be overkill, and you wouldn't really be able to train it very well. I mean, unless you have a few petaflops of compute lying around.
You could instead start with a smaller model. A traditional model, like an n-gram model, a Hidden Markove Model (HMM) or a Probabilistic Context Free Grammer (PCFG). The advantage of such smaller model is that they don't need to have billions of parameters to get good results, and you'll get more bang for the buck of the many, many, many examples of games you can find.
But, don't expect to get very far. A system that learns only to predict the best move will never beat a system that looks ahead a few dozen ply, with alpha-beta minimax, or that plays out entire game trees, like Monte-Carlo Tree Search. Well, unless you do something silly to severely hobble the search-based system, or train the predictive model with all possible chess games. Which I mean, is theoretically possible: you just need to build out an entire chess game tree :P
You could also try with a simpler game: Tic-Tac-Toe should be amenable to a predictive modelling approach. So should be simpler checker-board games like hexapawn. Or even checkers, which is after all solved.
But my question is, what would you hope to achieve with all this? What is the point of training a predictive model to play chess? Hasn't this been tried before, and shown to be no good compared to a search-based approach? If not, I'd be very surprised to find that out, and there might be some merit in trying to test the limits of the predictive approach. But it's going to be limited alright.
You're probably right. I'm still learning about neural networks and how transformers work, slowly going through Karpathy's youtube videos while building my own dumb little things.
Could you elaborate a bit more on why you think training a transformer only on chess moves(in algebraic notation, yes. Algebraic notation is the one that says <piece><square>, roughly speaking) wouldn't work? I'm not sure I understand.
As for your question, I don't really have a good answer. I've just been working on my own crazy chess AI ideas for a long while now and I was taken aback by the fact that GPT seems able to occasionally "find" long tactical sequences even in positions that have not occured before in known games. So it seemed only natural to try to think deeply about whether it represents some nugget of something useful, maybe even a fundamentally new approach. But I have serious doubts as I explained in GP.
It's also just been an interesting angle for me to understand what LLMs are doing because I'm deeply familiar with chess and methods of thinking about it both human and artifical. There's a lot more for me to grab onto than with any other application in demystifying it's behaviour.
>> Could you elaborate a bit more on why you think training a transformer only on chess moves(in algebraic notation, yes. Algebraic notation is the one that says <piece><square>, roughly speaking) wouldn't work? I'm not sure I understand.
Oh no, I think it would work. Just that it would be impossible for one person to train a Transformer to play good chess just by predicting the next move. Now that I think about it, ChatGPT's model is trained not only on algebraic notation (thanks!) but also on analyses of games, so the natural language in its initial prompt also directs it to play a certain ... kind? style? of game. I'm guessing anyway.
>> I've just been working on my own crazy chess AI ideas for a long while now and I was taken aback by the fact that GPT seems able to occasionally "find" long tactical sequences even in positions that have not ocurred before in known games.
Well, what GPT is doing is, fundamentally, compression. Normally we think of compression as what happens when we zip a file, right? You zip a file, then you unzip it, and you get the same file back. Forgetting about lossless and lossy information for a second, it is also possible to compress information so that you can uncompresss it into variations of the original.
Here's a very simple example: Suppose I decided to store a parse of the sentence "the cat eats a bat" as a Context-Free grammar.
sentence --> noun_phrase, verb_phrase.
noun_phrase --> det, noun.
verb_phrase --> verb, noun_phrase.
det --> [the].
det --> [a].
noun --> [cat].
noun --> [bat].
verb --> [eats].
Now that is a grammar that accepts, and generates, not only the initial sentence, "the cat eats a bat", but also the sentences: "the cat eats a cat", "the cat eats the cat", "a cat eats the cat", "the bat eats a cat", "the bat eats a bat", "a cat eats a cat", "a bat eats a bat" and so on.
So we started with a grammar that represents one string, and we ended up with a grammar that can spit out a whole bunch of strings that are not the original string. That's what I mean by "compress[ing] information so that you can uncompress it into variations of the original". And that's why they can generate never-before seen sequences, like you say. Because they generate them from bits and pieces of sequences they've already seen.
Obviously language models are very different models of language than grammars, and they also have weights that can be used to select certain generations with priority, over others, but that's more work for you.
Again, all that's nothing to do with Transformers. It's just a way to understand how you can start with some encoding of one sentence, and generate many more. Fundamentally, language modelling works the same regardless of the specific model.
Edit: note also that the grammar above isn't compressing the original sentence "the cat eats a bat" at a very high rate, but if you take into account all the other sentences it can generate, that's a good rate of compression.
There is much more to generative models than building out language models and image models.
Generative models are about characterising probability distributions. If you ever predict more than just the average of something using data, then you are doing generative modelling.
The difference between generative modelling and predictive modelling is similar to the difference between stochastic modelling and deterministic modelling in the traditional applied mathematical sciences. Both have their place. Neither is overrated.
[0] https://en.wikipedia.org/wiki/Indiana_Pi_Bill