Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Human-like neural network chess engine trained on lichess games (maiachess.com)
158 points by unlog on Jan 17, 2021 | hide | past | favorite | 68 comments


This bot is a pure joy to play against! For a 1560 rated bot:

    1. It brings out the queen early attacking the f7 pawn when black plays sicilian.
    2. So far has gone king's indian almost every time against d4 (e.g. catalan) and then has failed to challenge the center, instead going for a king side attack (exchanging bishop for knight to try to open the B file.
    3. It sometimes drops its queen in complex exchanges and hidden attacks.
    4. It sometimes drops pieces in complex exchanges (can't count how many pieces are covering a square).
    5. It will exploit mistakes that *I can see*, unlike stockfish level 7+ where it will take you down an insane convoluted path to get you into such a bind to destroy your position (I've played GMs who ?can't?/don't do in this blitz).
    6. Its attacks are shallow and lack depth, easily defended.
    7. It sometimes moves a single piece too many times when it should be trying to advance its positional game (e.g. queen/knight), basically has very little positional play.
    8. It sometimes pushes pawns aggresively to its detriment.
    9. Won't resign in losing positions (*lol*).
       a. Will play to the bitter end to try to get you to stalemate.
Couple things that could make it more human:

    1. Moves fast! I think it actually moves too fast, there needs to be a better delay factor added in depending on rating. If we could make it think for a longer than usual amount of time after it finishes development, in complex positions, or when it's close to mate or about to lose an exchange. Also make it speed up when its getting low on time.
    2. Make it randomly rage quit in a losing position like an asshole instead of resigning so you have to wait for the quit/disconnect detection and then claim victory/draw countdown (I jest, I jest, but if we do this, please make it some sort of setting).


I wasn't able to find what time setting the AI was trained on, but I'm a 1400 bullet player and at that level it is uncommon to resign even if you are down a minor piece and a pawn (or more, but in a good attacking position). The probability of being able to win due to time/a blunder is quite high.

I even saw an IM vs. NM bullet game the other day where the NM was in a losing position but stayed in to grab a stalemate: https://www.reddit.com/r/chess/comments/kwoikt/im_not_a_gm_l.... Not sure if Levy was being unsportsmanlike to stay in the game despite being in a losing position, but even at a high level I think it's normal to play to the end if your opponent is in time trouble.


They're trained on everything but Bullet/HyperBullet. The neural network just predicts moves and win probabilities so we don't have a way (yet) of making it concede.


Is there a way to treat resignation as a "move"? Even though the win probability is zero by definition, it still may be the most accurate move prediction in certain scenarios.


Yes, the output is just a large vector with each dimension mapping to a move. But we'd probably do it as different "head" so have a small set of layers that are trained just to predict resignations. Both options would break the lc0 chess engine we use for stuff like the Lichess bots though.


It's never unsporting to play on in a bullet game since it's so short, unless it's a long drawn out stall that isn't making any progress. The winning player can quickly finish the game if it's a clear lost cause.


Haha


This kind of "human at a particular level" play is something I've personally wished for many times. I find playing against programs very frustrating, because as you tweak the controls they tend to go very quickly from obviously brain-damaged to utterly inscrutable. Win by a mile or lose by a mile, don't learn much either way. Sometimes there's a very thin band in between that's the worst of both worlds: generally way above my own level, but every once in a while they'll just throw away a piece in the most obvious possible way. If a human did that I'd interpret it as toying with me, or taunting.

This kind of program seems like it would be much more satisfying to play just for fun, and perhaps (with a bit more analysis support) better still as a coaching tool.


Try the CrazyBishop-based games aka Chess Lvl 100 / The Chess.


Very interesting research!

A particular use case that's implied by the features is the ability to analyze errors that you would make as opposed to the exact errors that you made; as the personalized "Maia-transfer" model seems to have an ability predict the specific blunders that the targeted player is likely to make, those scenarios can be automatically generated (by having Maia play against Stockfish many times) and presented as personalized training exercises to improve the specific weak spots that you have.


Yes, that's exactly one of our goals. In the paper we even have a section on predicting which boards lead to mistakes (in general). The results were much weaker than the move prediction, but we're still working on it and will hopefully publish a followup paper.


Overall the games were enjoyable, however this game stood out as an issue with the engine. A long gif, but notice the moves at the very end where it had three queens and refused to checkmate me. https://lichess1.org/game/export/gif/M0pJAiyL.gif

38. Kxa5 Nxg2 39. Kb6 f5 40. h4 f4 41. h5 gxh5 42. Kc7 f3 43. Kd6 f2 44. Ke7 f1=Q 45. Ke8 Qe1+ 46. Kd7 h4 47. Kd6 h3 48. Kd5 h2 49. Kd4 h1=Q 50. Kd5 h5 51. Kd6 h4 52. Kd7 h3 53. Kd8 h2 54. Kd7 Qhg1 55. Kd6 h1=Q 56. Kd5 Ne3+ 57. Kd6 Nf5+ 58. Kd7 Ng7 59. Kc7 Qd1 60. Kc8 Qc1+ 61. Kd7 Qgd1+ 62. Ke7 Qhe1+ 63. Kf6 Nh5+ 64. Kg6 Nf4+ 65. Kf5 Nh3 66. Kf6 Nf2 67. Kg6 Kf8 68. Kf6 Ke8 69. Kg6 Kd8 70. Kg7 Kc8 71. Kg8 Kb8 72. Kg7 Ka8 73. Kg8 Ka7 74. Kg7 Ka6 75. Kg8 Ka5 76. Kg7 Ka4 77. Kg6 Kb3 78. Kg7 Ka2 79. Kg8 Ka1 80. Kg7 Ka2 81. Kg6 Ka1 82. Kg7 Ka2 { The game is a draw. } 1/2-1/2

These were bullet games where it was rated at 1700 and I am rated 1300ish...however I won a number of games against it. I never felt like I never had a chance.


> A long gif, but notice the moves at the very end where it had three queens and refused to checkmate me.

I guess that part of the position space was undersampled in the training data!


I think this is very interesting. One comment I have heard about Leelachess is that she, near the beginning of her training, would make the kinds of mistakes a 1500 player makes, then play like a 1900 player or so, before finally playing like a slightly passive and very strategic super Grandmaster.

One interesting thing to see would be how low-rated humans make different mistakes than Leela does with an early training set. How closely are we modeling how humans learn to play Chess with Leela?

Another thought: Leela, against weaker computers, draws a lot more than Stockfish. While Leela beats Stockfish in head to head competitions, in round robins, Stockfish wins against weaker computer programs more than Leela does.

I believe this is because Stockfish will play very aggressively to try and create a weakness in game against a lower rated computer, while Leela will “see” that trying to create that weakness will weaken Leela’s own position. The trick to winning Chess is not to make the “perfect” move for a given position, but to play the move that is most likely to make one’s opponent make a mistake and weaken their position.

Now, if Maia were trained against Stockfish moves instead of human moves, I wonder if we could make a training set that results in play a little less passive than Leela’s play.

(I’m also curious how Maia at various rating levels would defend as Black against the Compromised defense of the Evans Gambit — that’s 1. e4 e5 2. Nf3 Nc6 3. Bc4 Bc5 4. b4 Bxb4 5. c3 Ba5 6. d6 exd4 7. O-O dxc3 — where Black has three pawns and white has a very strong, probably winning, attack. It’s a weak opening for Black, who shouldn’t be so greedy, but I’m studying right now how it’s played to see how White wins with a strong attack on Black’s king. I’m right now downloading maia1 — Maia at 1100 — games from openingtree.com.)


I only found one game where Maia1 (i.e. Maia at 1100 ELO) lost playing the black pieces with the Evans Gambit Compromised defense:

1. e4 e5 2. Nf3 Nc6 3. Bc4 Bc5 4. b4 Bxb4 5. c3 Ba5 6. Ba3 d6 7. d4 exd4 8. O-O dxc3 9. Qd3 Nf6 10. Nxc3 O-O 11. Rad1 Bg4 12. h3 Bxf3 13. Qxf3 Ne5 14. Qe2 Bxc3 15. Bc1 Nxc4 16. Qxc4 Be5 17. f4 d5 18. exd5 Bd6 19. f5 Re8 20. Bg5 h6 21. Bh4 g5 22. fxg6 fxg6 23. Rxf6 g5 24. Rg6+ Kh7 25. Qd3 gxh4 26. Rxd6+ Kg7 27. Qg6+ Kf8 28. Rf1+ Ke7 29. Rf7# 1-0


Actually stockfish crushed leela in recent TCEC. It seems that the new neural network of stockfish had a huge effect on the performances. Something like a 130 ELO improvement.


I can not find a recent tournament where Stockfish has crushed Leela in head-to-head play.

This is their most recent ongoing head-to-head: https://www.chess.com/events/2021-tcec-20-superfinal

Current result: 9 draws, one win with Stockfish as White, and one win with Leela as White. Drawn.

There is also this one from a couple of years ago: https://www.chess.com/news/view/computer-chess-championship-... “Lc0 defeated Stockfish in their head-to-head match, four wins to three”. Stockfish did got more wins against the other computers, so won the round-robin, but in head-to-head games Leela was ahead of Stockfish.


https://en.wikipedia.org/wiki/TCEC_Season_19

stockfish finished +9 over 100 games.

I don't have any stock in those 2 engines, so I don't care which one is better than the other. At the end as a poor chess player it won't change anything :) It's actually interesting to compare how those two software are evolving and how they got here.

Stockfish is much older. And it took it a lot of hand tuning to reach its current level. It is (or was) full of carefully tested heuristic to give a direction to the computation. It would be very difficult to build an engine like stockfish in a short span.

Leela got there very very quickly. Even if it was not able to win in October, the fact that it got competitive and forced the field to adopt drastic changes in such a short period of time is impressive. It seems to be a good example of how sometimes no using the "best" solution could still be a win. Getting good results after a few months against something that required 10 years of work.


Thank you for getting back to me with a source. I agree Stockfish had a significant edge over Leela in that contest from a year ago.

Right now, Stockfish is winning in the current TETC, but only by one point (one more win than Leela). https://www.chess.com/events/2021-tcec-20-superfinal Stockfish 12 27.5 - Leela 26.5


It's worth noting that this approach, of training a neural net on human games of a certain level and not doing tree search, has been around for a few years in the Go program Crazy Stone (https://www.remi-coulom.fr/CrazyStone/). (There wasn't a paper written about it so it's not common knowledge, and I assume the authors of this paper weren't aware of it or they would have cited it.)


Hmm... maia1100 is ranked ~1600. https://lichess.org/@/maia1


I think the developers explained the reason for this in a Reddit thread: Collectively a bunch of 1100 players are stronger than 1100. Imagine that a 1100 player will play one bad move for every two decent moves. However, different players miss different moves so the most picked move in each position will usually be a decent move.


This makes perfect sense - but is a bit problematic given the intended goal of the project. They've built a bot that plays like an 'averaged group' of humans, not a human. Maybe they need a better way of sampling from the outputs.


This is always the problem with training from historical data only: you’ll become very good at being just as good as the sample group.

Ideally you just use the sample as the basis, and then let an AI engine play against itself for training, and/or participate in real world games, such as they did with AlphaGo and/or AlphaStar.


Seems analogous to the average faces photography project where the composite faces of a large number of men or women end up being more attractive than you'd imagine for an average person.


This only true if you select the most likely move instead of sampling from the probability distribution over possible moves. In the latter case there is no reason for there to be a wisdom of the crowd effect.

If you sample from the probability distribution you are modeling, there is no reason it shouldn't play like a 1100 player.


Well one reason is that it still won't be a 'single' player - what they've done here is like having a group of thousands of 1100 players vote on a move. What you're suggesting is to then pick a random player each move and go with them. There's no consistency, maybe on move 10 the player is blind to an attacking idea, but then on move 11 suddenly finds it...


I think they are saying, if your neural network was probabilistic and you thought there was a 90% chance of someone doing move A, but a 10% chance of move B, then you shouldn’t always get move A if it was human like - you would sometimes get move B.

I.e. most of the time if you leave your queen hanging and under threat your opponent will take it, but sometimes they just don’t see it. That’s the difference between playing a bot and a human a lot of the time - humans can get away with a serious blunder more often at low level play.


That's exactly what I'm saying - except more like the model is saying there's a 90% chance that a randomly chosen player at this level would make the move.


Yes, if you don't condition on the past moves then the distribution you're modeling is where you randomly pick a 1100 player to choose each move as you say. What I'm saying is that there will be no wisdom of the crowd effect.


> maybe on move 10 the player is blind to an attacking idea, but then on move 11 suddenly finds it...

What part of that is unrealistic? This happens constantly to human players, and not only at 1100...


Unintentional ensemble learning :)


Maybe. I think if you had asked me to predict the rating I would have guessed below 1100 though. Because of only predicting moves in isolation. I would expect that the moves would not form a coherent whole working together in a good way.

So it's an interesting result to me.


That's what a few machine learning people I talked too thought would happen. There are lots of examples (self driving cars being the big one) in machine learning where training on individual examples isn't enough. We were actually hoping for our models to be as strong as the humans they're trained on so we are underperforming our target in that way.

I'm the main developer for Maia chess


Have you thought about trying a GAN or an actor-critic approach? Did you find it infeasible?


> Because of only predicting moves in isolation.

This actually may be the reason of higher ranking. What are the odds that a low ranked player will blunder a piece in a particular position? Quite low. But what are the odds that a low ranked player will blunder a piece in a game? They are quite high. So while this engine may predict most likely move, it can’t fake a likely game because it is too consistent.

I think GANs can be helpful to do something like this. One NN tries to make a man-like move and another one tries to guess whether the move was made by a human or the engine given the history of moves in a game.


I suppose the game at 1100 is really bad, such that it's mainly about avoiding obvious blunders, and not about having a sound long term strategy.


The real reason is that 1100 players are ranked ~1600 on Lichess. There's a good site that compares FIDE ratings, Lichess ratings, and Chess.com ratings. Lichess is inflated by many hundred points on the low end. Chess.com is more accurate. They converge towards the upper end of the human rating range.


I think the reason is that if you pick the most likely move for a 1100 player on every move, they would be a 1600 player.

Let’s say you accidentally leave a pawn hanging and 90% of 1100 players would spot it, and 10% of the time they miss it. In this engine, as the 90% is the most likely move it spots it 100% of the time.

So this example shows that if you pick the most likely move for a 1100 player every move, you end up scoring better than a 1100 player. Of course it works the other way too on spotting brilliant moves that others miss, but I guess at 1100 level play there are more opportunities to mess up!


The training data is also from lichess so I don't think that is it.


From the README on Github:

> Note also, the models are also stronger than the rating they are trained on since they make the average move of a player at that rating.

Reference: https://github.com/CSSLab/maia-chess


This probably breaks lichess cheat detection.


Which is unfortunate, but at least the players who play this bot hopefully have a more enjoyable game than the ones who play a depth-limited stockfish, for example.


Think about this more broadly

Detecting deepfakes and generating them are just adversarial training that will make deepfakes even better and then our society won’t trust any video or audio without cryptographically signed watermarks.


we don't have anything like that with photos and things have turned out ok


It's too early to say that. The big problems are coming as deepfaking gets cheaper and easier. Scammers are using deepfake photos to aid in their scams. It's going to get worse, especially anyplace that photos are used as proof or evidence.


It's a little different with videos and audio, though. People place much more trust in them, for now at least, and people do generally still trust photos more than text, so there is a wider challenge as we become more able to suborn higher levels of truthiness for propoganda/memes.


People have always found reasons to distrust things that they don't like.


This is very cool. I think this could be extended to create a program that finds the best practical moves for a given level of play. Instead of just predicting the most likely human move, it could suggest the current move with the best "expected value" based on likely future human moves from both sides.

As an example, let's say there's a position where the best technical move will lead to a tiny edge with perfect play. Current programs like Stockfish and Alpha Zero will recommend that move. It would be better to instead recommend the move with a strong attack that will lead to a large advantage 95% of the time, even if it will lead to no advantage with perfect play. It seems one could extend Maia Chess to develop such a program.


As someone who is colorblind, the graphs are unfortunately impossible to follow. Perhaps you could use an additional method of distinguishing data on the graphs other than color? Like dashed lines, or lines with symbols on them, or varying thickness, etc.


Thanks for mentioning that. The graphs do use different dashes to distinguish the colour palletes, which are supposed to be colour blind friendly. So I thought we'd be OK. I'll take a look into adding more non-colour distinguishing features.


As a long time chess player and moderately rated (2100) player, this is a fascinating development! I played the 1900 and beat it in a pretty interesting game! Can you increase the strength to 2100+?



Do read the paper. (Click "PAPER" in the top menu.) I found it very interesting.


Do they train separate NNs for each time control?

If not, I wonder if that would make accuracy even higher!


They filter out fast games (bullet and faster) and moves where one has less than 30 seconds total time to make the move. They say this is to avoid capturing "random moves" in time trouble


It's interesting, because IMHO the moves that humans make when in time trouble (which intuitively look decent but have unforeseen consequences) would be the exact thing that you would want to capture for a human-like opponent that makes human-like suboptimal moves.


We did not, we removed bullet games because they tend to be more random, and also did some filtering of the other games to remove moves made with low amounts of time remaining for the same reason. But the models don't know about different time controls right now.


grandmasters typically play a small number of (recorded) games per year, would it be possible to train a neural network on lots of games to recognize what were likely moves of a player from a small number of games - so that you could have a computer to work with that would prepare you for a match with a grandmaster?


lately grandmasters play a huge number of recorded games per year online


yeah, these days you can tune into a twitch stream and grab some top grandmaster games any day of the week


huh, even better, although guess I'm behind the times.


I'm wondering something similar, where maybe GMs could train against a neural network that is built off of their upcoming opponent's historical games and thus they would get more experience against that 'opponent'. Instead of having to actually have played against the opponent themselves to learn weaknesses.


I think there is an app that claims to let you play against Magnus Carlsen at different ages. I always assumed this is how they imemented that feature.


With transfer training on IM John Bartholomew, Trained on MaiaChess predicts ...d5 with high accuracy. #scandi


Do you think one day we can have AI reverse hashes by being trained on tons of data points the other way?

My guess is no, because you have to get an exact output of a function which is not continuous at all. But maybe I am missing something?


I thought this said "lichen" and it was some sort sort of crazy fungi network for a second, like the slime mold and the light experiment. Oh well...


Step 2 with AI, see if you can make it human.

It's great to see research on this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: