Statistics from a 27M game sample

Hi!

I managed to download the full JSON dataset from @za3k OGS database dump…

…and to reduce it at a convenient size to calculate some stats on it.

I’ve been able to convert 27085332 records. I may have lost some, I hope less than 2000.
There are 26935658 OGS games and 149674 uploaded SGF.
I’m going to focus on OGS games only, since SGF are less interesting for me and they may have weird data (as an example, someone uploaded some SGF apparently with huge boards: 100x100, 666x666, 1000x1000).

Here is my first bit: which boards were used on OGS?
I already did that analysis on a small sample
https://forums.online-go.com/t/can-we-get-an-sgf-database-dump/38837/46
but here is the full monty! :smiley:

Apparently there still are some unesplored places on this map!

Here is the result when we remove standard boards (9x9, 13x13 and 19x19):

Top choice is 5x5.

Board Games
19x19 12739207
9x9 10647310
13x13 3412104
5x5 41350
15x15 13957
7x7 11244
25x25 10161

I’ll update this thread when I have something new to share.

27 Likes

In the other thread I shared various move frequency diagrams from the 100k-sample, here is one using all 10 837 323 non-handicap 19x19 games in the dataset:

image

Click on the image for an interactive diagram where you can move the slider yourself to see the frequency throughout moves 1-300 animated, it’s quite pretty to look at!

Edit: The site where I hosted the diagram updated and now the data is super slow to load for some reason, so the page will be blank for a minute before the diagram appears.

I also searched for some games with special properties, such as ones with many consecuctive moves on the first line. I was hoping to find stuff like these, legitimate games that just happen to show a nice pattern:

16 consecutive moves on the first line

15 consecutive moves on the second line

But of course, these soon get beaten by garbage games like these:

24 consecutive moves on the first line

36 consecutive moves on the second line

(I stopped searching after I found these, likely there is even bigger garbage later in the collection, but I don’t find that very interesting)

If anyone has suggestions for other stats/special games that would be interesting and not too hard to check, let me know! My old computer takes about one hour to chew through the entire 82 GB json file, kind of quicker than I was expecting.

15 Likes

I’m amazed that 17x17 didn’t make the top 7 considering it used to be the main board size I would have thought it would still see some play…

4 Likes

Maybe the obvious things would be like histograms of

  • wins losses by Black or White, either split by Komi handicap etc. (probably the summary data is already in one of anoeks rating update threads though)
  • wins by score and how much, or by resignations etc.
  • lengths of games, which would have various bumps at common time settings.

Various things were done in the past with random samples (edit: my mistake) I think like rank histograms etc Unofficial OGS rank histogram 2021 - #23 by DVbS78rkR7NVe but I suppose those things work better in snapshots, not necessarily with a whole list of games. Pulling out the ranks won’t be useful unless it’s checking if the “stronger” player won etc.

I guess one could try to look at the popularity of fuseki over time. That is look for some named openings like the Sanrensei, low/high Chinese, Kobayashi, and make a histogram over time (start time of the game?) of how frequently these are played by Black or white etc. The openings I suppose are mainly named for Black though, so just checking Blacks first three moves and maybe who won the game :stuck_out_tongue: (although winning would probably not be just because of the opening :slight_smile: )

If you can think of some AI fuseki as well one could check those over time, see if they appear much earlier than expected etc.

In theory you could try to look for all kinds of funny stuff, like of the games that went to scoring, how many groups were there on the board for each color, assuming you wrote a bit of code to count the number of groups, or what is probably the same, distinct areas belonging to each player in an area sense. If you had that bit of code, you could even count the number of small life groups appearing, worth X points in area scoring say for X<10 or 15 or something. EDIT: (although maybe that’d be a hassle, basically making a scoring tool :P)

5 Likes

Something else: try graphing the log of the data, that’d be easier to read, in a way :slight_smile:

1 Like

I am still curious about the tengen frequency and occurrence throughout the game,

I am also curious about the relationship between shapes and move sequence, like what this post implied

Is it really true that players at lower level don’t actually play common shape move like tiger’s mouth, etc. I know from teaching some young kids, that they tend to like “straight” answer, like continuous extension, or hane hane a lot, but at some point they will start to learn to jump, knight’s move, large knight, two space jumps even later, not to mention bamboo joint and table shape. I kind of wonder if this is actually the case as players advance in rank?

3 Likes

A bit about lenght of games.

Here are all completed games from years 2018-2021 (start date) divided by days of duration.
The big orange dot represent all the games that ended within 24 hours.
Every other dot is for other integer values (2 days, 3 days and so on…).

As a correspondence player I feel quite diminished. :smiley:

There’s one game that lasted 10 … negative days! :smiley:

Here is the longest game: 2060 days!

I have issues finding the shortest games, since there are plenty of games that lasted… zero seconds!
I’m talking about completed, scored games. I fear there must be some kind of bug somewhere.
Here are few examples:

6 Likes

Since you’re having fun, you might as well give a complete answer to this one.


That one isn’t random sample, it’s pretty complete. There were some games I couldn’t get my grabby hands onto but the next iteration will be even better.

4 Likes

Aha, that’s pretty cool, and probably makes it even cooler. I kind of expected it to be a lot of work to try and grab all the accounts, so some sampling might have been done, but I stand corrected!

Game lengths from 2 632 713 non-handicap 19x19 games that had a score result:

image

The average is 253.5 moves.

Games longer than 500 moves were ignored and not included in the average.

For this one I decided to calculate game length as (length of movelist) - 2, to remove the final two passes.

I failed to take into account that under AGA rules, white must pass last - I believe this is the reason for the spikes on even game lengths.

If I ran the analysis again, this time removing all trailing passes, we would get a smoother histogram and a sliiiightly lower average.

6 Likes

Did you consider filtering out unranked games?

Like we see your examples there may be much more outlier (longer or shorter) unranked games.

To remind, the original question is

27M games should get enough of non-standard boards but needs filtering. So it’s annoying.

1 Like

That would maybe be a good idea (although there would still be some “garbage” in ranked games too).

Ah, right! I was mostly interested in this question for high-level (or at least “reasonable”) games. I wonder what would be the best way to explore this with just the OGS dataset.

One could calculate game length divided by area for each game and make a histogram like above. I guess this would give a normal-ish distribution but it wouldn’t really verify or falsify the hypothesis. So better would be to plot average length against area - but we have so much more data on 9x9, 13x13 and 19x19 than the other sizes, so different datapoints would have very different confidence. But I think this plot could be interesting, I’ll probably try that at some point.

1 Like

I got curious after the last post how many games use AGA rules. Not that many, it turns out:

6 Likes

Wow, does literally no one use Ing? The distinction is probably not so relevant, since I guess OGS must not actually implement the Ing ko rules, since I doubt that anyone can implement the Ing ko rules.

This is only necessary when using territory counting with AGA rules, and it does need to be implemented as an actual additional pass, but just one additional pass stone handed over (and done only once, after all resumptions and life/death disputes have been resolved).

I’m not sure OGS AGA rules actual does it that way, and I think it just uses area counting for AGA rules to get the equivalent score without having to account for pass stones.

1 Like

This was the data I plotted:

image

Seems like Google Sheets just left Ing out from the pie chart since the slice would be < 0.1% :stuck_out_tongue:

Before pasting into Sheets, I filtered out even smaller entries to get rid of some garbage from uploaded SGF:s, here is what the full output looked like:

{'aga': 180273, 'japanese': 19531122, 'nz': 37072, 'chinese': 7120701, 'korean': 201751, 'ogs': 266, 'finn': 1, 'ing': 14192, 'aga (area)': 194, 'simple': 69, '1': 46, 'ing rules': 1, 'old chinese': 1, 'jp': 427, 'jpn': 11, 'relay game': 2, 'uchikomi': 96, 'w wins jigo': 34, 'w gives komi': 7, 'white gives ': 1, 'taiwanese (a': 1, 'free placeme': 1, 'free handica': 1, 'aga (territo': 28, '日本': 9, 'aga (fläche': 1, 'японск': 1, 'подсче': 2, 'ja': 1, '': 16, 'aga (地)': 2, '地を数え': 22, 'tang (japane': 1, 'japanische r': 4, 'china': 1, 'stone': 7, 'american': 1, 'kosimplescor': 5, 'japonesas': 2, 'japenese': 1, 'zh': 45, 'mine': 1, 'cn': 8, "ikeda's area": 1, 'egf': 1}

I also think so, but there is a “white_must_pass_last” property in the json which seems to be true whenever the rules are AGA. So maybe an extra pass is just automatically added to the end of the move list in those cases?

I don’t really care enough to dive in and figure out the details, I’ll leave that to someone else :stuck_out_tongue:

3 Likes

I wonder what % Japanese rules would have if it wasn’t set as default on OGS.

UPDATE: much better version: Statistics from a 27M game sample - #29 by stone_defender


top 10 moves: Statistics from a 27M game sample - #94 by stone_defender


repainted your diagram
odd number moves are black after all
and replaced the least popular moves with board background

first 20 moves:

14 Likes

At least as far as I can see on the website, games with AGA rules do not show white necessarily passing last or any pass stones. Example:

At least it seems to get the handicap adjustment right!

1 Like