-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AIs calibrated to kyu/dan strength with easier to understand settings #44
Comments
@bale-go amazing work! One issue I have with formulas like this is deciding which parameters are exposed to users and how to expose or explain them to users -- or hide them alltogether. |
Thank you for the suggestion about katrain-bots. I used PyAutoGUI to automate the games. More than 10 games were run for each bot. After finding iteratively the correct parameter (total number of moves seen by katago), the games were quite balanced, without any serious blunder. None of the bots had an extra advantage at the beginning, middle or the end of the game. Black: pachi -t =5000:15000 (rank=3d) Even games with different bots at different total number of moves seen by katago. GnuGo 3.8 at level 8 (ca. 10 kyu): 24 Linear regression did not give a good fit in this wider rank range. Theoretically it makes more sense to use the log10 of the total number of moves seen. That way it is not possible to have negative number of seen moves. The equation: (total number of moves seen by katago) = int(round(10**(-0.05737*kyu + 1.9482))) The equation works for ranks from 12 kyu to 3 dan, which covers more than 90% of active players. The equation has an other nice feature. Extrapolating the line gives ca. 10 dan for perfect play, where total number of moves seen is the size of the go board (361).
I think it would be nice to have a simple view, where one could set the kyu level directly. |
any way you could have a look at this for p:weighted and scoreloss? I think they're my prefered AIs and curious how they perform on blunders in early vs end game. |
The reason I did not use scoreloss, is that it heavily depends on the max_visits and it is much slower. Theoretically, I find the approach of P:Pick better. The value of the NN_policy seems rather arbitrary in many cases. One can see this by comparing the values of the score_loss and the NN_policy for a given position. The absolute value of NN_policy does not directly reflect the score_loss. For example, NN_polciy(best) = 0.71 and score_loss(best) = 0 points; NN_polciy(second_best) = 0.21 and score_loss(second_best) = 1 points. However, I found that the order of the moves from the best to the worst is very similar for score_loss and NN_Policy. P:Weighted relies on the absolute value of NN_policy. P:Pick relies on the order of the moves of NN_policy. |
The compute cost and visits conflation is definitely an issue. However, a major weakness of pick over weighted is being blind to 'all' good moves on a regular basis and playing some policy <<1% move, at which point the ordering is quite random. |
I guess what I try to argue here is that having a policy value less than 1% is not a problem per se. If you check amateur human vs. human games, there are multiple less than 1% or even less than 0.1% moves. The obvious blunders can be removed by using a shifting pick_override setting (80% initially to 50% endgame). In the end the user experience is the most important. The runs with different bots show that the modified P:Pick policy makes a balanced opponent for a wide range of ranks. |
<<1% is more like 0.1%, which more often is problematic (first line moves and such). |
This is the first time I use GitHub (I only registered to participate in this fascinating project). |
refactored a bit after the merge and added tests since it was turning into quite the spaghetti. It went all the way to losing by 80 points to near jigo against p:weighted and looks nice -- what bounds do you think there are on the rank? |
The upper limit currently is the strength of the policy network, around 4d. |
|
Pretty cool! |
some real weird stuff in the weaker one though (e.g. 153/155) |
Isn't Katrain just trying to start a capturing race in the corner? |
I didn't think it would work out so so well. All of the ranks are within one stone, except for katrain-6k, which was still 5k in the morning. I was thinking about using this method to assess the overall play strength of a player. I saw something similar in GNU Backgammon. It is possible to estimate your skill by looking at your moves. Currently the analysis mode can help you discover certain very bad decisions, but I think it might also be important to see the consistency of all of your moves. I'm currently working on dividing the game in 50 move pieces, and calculating the kyu rank for each part of the game (opening (0-50 moves), early middle game(50-100 moves), late middle game (100-150 moves), endgame (150-)) by the median rank of the moves (best move is 1st, second bes is 2nd etc.). |
@bale-go I went a bit lower, since especially at those ranks people seem to looooove bots. |
Move 153: B B18 AI thought process: Using policy based strategy, base top 5 moves are A17 (18.12%), F19 (13.48%), E16 (10.26%), A10 (8.22%), D18 (6.56%). Picked 8 random moves according to weights. Top 5 among these were B18 (2.42%), R7 (0.12%), S11 (0.01%), P15 (0.01%), T12 (0.00%) and picked top B18. Move 155: B C17 AI thought process: Using policy based strategy, base top 5 moves are F19 (24.98%), H18 (24.70%), E16 (15.19%), L6 (10.72%), B13 (6.61%). Picked 8 random moves according to weights. Top 5 among these were C17 (0.04%), Q15 (0.02%), S3 (0.01%), P2 (0.01%), G10 (0.01%) and picked top C17. didn't realize n=8 at this level, makes more sense now :) |
The success in covering a wide range of strengths with the policy pick method shows to me that it captures some important aspects of the difference in beginner and expert understanding of the game. In line with the p-pick-rank method, it is not far fetched to assert - according to the bot calibration and ogs data - that a 3k player chooses the best move from ~60 possible moves (M). In other words 3k players will find the 5th best move on average (on median ;) ) during their games. But we can reverse the question. If we observe by the analysis of a much stronger player (katago) that the median rank of moves is 5 we can argue that the player is ca. 3 kyu. As I mentioned earlier, we can use this method to evaluate parts of the game. I wrote a script to calculate the ranks by this method. Here are two examples to showcase it. It seems that GnuGo developers did a terrific job with the opening (hardcoding josekis etc.) and the endgame, but the middle game needs some improvement. pachi -t =5000 --nodcnn (3 kyu): Pachi was ahead in the first 100 moves in the game with katrain3k, but it made a bad move and MCTS bots are known for playing weirdly when losing. The changing ranks show this shift. Please let me know if you are interested in a PR. |
18k seems suspect, no? that's a huge rank difference. Then again, pachi doing well...is it just biased towards MCTS 'style'? |
Indeed, 18k is a huge difference. In the long run, maybe it would be better to color code the parts of the game, similarly to the point loss. The calculated rank for the total game would be the reference. If a part of the game is much worse (worse than -5k) it would be purple; -2k to -5k red; -2k to +2k green (neutral); better than +2k blue (congrats!). However, this scale would be independent of the score loss of individual moves. It would assess the overall quality of that part of the game. Due to the application of median the calculated ranks are resistant to outliers (blunders, lucky guesses etc.). Indeed, it could show that player A was better than player B in the quality of play, but player A made a blunder and lost the game. |
What do you think of a short text-based report at the end of a game to start with? It could go into sgfs and even be sent in chat on ogs |
I think that would be awesome. I made two analysis of two recent games on ogs.
katrain-10k(B) won the second game in a very close match (B+1.5). It played at ca. 7k level during the game.
|
It's strange that the bots don't play at their level -- are you sure you're not off by some factor due to it being 'the best among n moves' and not 'this rank'? |
I think it is due to the underlying randomness of the p-pick-rank method.
|
@SimonLewis7407 Thank you for the kind words! |
Found a mistake in the mean/median curves (copy-pasted var), updated ones below. even more so, there is a difference in the top policy value for human and bot moves: i.e. the bots/humans leave the game in a significantly different state for their opponent. |
I think the 'move picking' effectively does this, it has some expected value (which is in the thread) and deviation (which we don't know) |
Thank you bale-go. Yeah, I shouldn't have said gaussian, what I should have said was some curve or formula (or even a brute-force list of approximations) that matches those blue distribution charts (the six that correspond to humans) shown by sanderland a little higher up on this page. If those distributions represent human play, and the bots can approximate those distributions in their own play, that is great! I will check them out on OGS like you suggested, or maybe clone v1.2 like you mentioned. P.S. That said, still there is some unavoidable change that needs to be made to a bot's playing when you reach the endgame, right? I don't know how to define "endgame" but up until then, the average difference in quality between best policy move and second-best policy move [or among the top n moves for small values of n] is very small, but then it swells very large for the endgame. |
@sanderland In move rank vs. kyu plots you need to use the same x-axis. |
aaahhh! of course, fixed :) |
We decrease override, so it more readily plays the top move when there are fewer available moves |
Thanks sanderland! I am not very sophisticated about these things, but I see that formula for decreasing override will decrease it gently, over a large number of moves. That definitely points things in the right direction. But my suspicion is that in real go play there is more of a "quantum jump" that might need to be reflected. Like, when the bot recognizes some combination of conditions (fewer available moves, for sure, but maybe some other factors too), then it needs to make an extra, further decrease of the override for the remainder of the game. |
@SimonLewis7407 Decreasing the override is not the only strategy p-pick-rank utilizes to improve the endgame. |
@sanderland The new score loss histograms are amazing! Updated calculated kyu vs. ogs kyu with 2182 games for users. |
I spent some time looking at |
With the charts just above, what is the difference between the two charts on the left and the two on the right? The legend doesn't say, sorry it's not obvious to me. |
Oh crap, they do say! It's mean and median, sorry, I missed that. |
Again it is pretty nice how the plots for users and bots line up. It seems that point loss estimation with 1 visit is not perfect for 15b. It is not terribly bad though. 1 point miscalculation is pretty tolerable in most <3d games. I wanted to check if move ranks are more robust to the change of models. They certainly are. I think in the next version we could use the move rank vs. # of legal moves curves of users to create bots that mimic human play even better (e.g. we could let katago see slightly more moves in the opening and slightly less in the middle game). However, I would not change the kyu rank estimation script. I think it is a really important for the user to know objectively which part of the game they excel at. BTW, @sanderland do you plan to introduce the kyu rank estimation in the form of a short text message in the current branch? |
@bale-go I think it's better to polish it a bit more and put it into 1.3 |
Thanks for checking how robust rank moves are to model changes, I was afraid the 20b was getting overfitted / to narrow but it seems we don't have to worry. |
@sanderland Do you have the raw data for them? I'm looking to analyze them a bit and get some rough guidelines for users how much they need to reduce various errors types/sizes to reach the next rank level. This can provide some context for how bad mistakes of various sizes are. |
@killerducky see https://github.com/sanderland/katrain-bots/blob/master/analyze_games.ipynb and other notebooks in that repo. |
Current options are rather mathematical, calibrating some settings -> kyu/dan rank and a slider that sets them would improve usability.
The text was updated successfully, but these errors were encountered: