Statistics from a 27M game sample

Sorry i don’t understand.

on moves 1,2 ?

move number 1 (black), move number 2 (white)
1-1 is coordinate


yes

Does it mean that in the database 6250 games started with a 1-1?

In a large enough data set, there are bound to be some trolls and misclicks

I think it would make sense to filter out some of the very rare moves to reduce such noise

1 Like

Timesujis?
For white 2 on 1-1 we have forgotten handicap
(For black 1 we could say forgotten handicap + wrong color)
And last we have full beginners too.

top 10, click to open, click to zoom

Top 10 coordinates are in the center for more than 100 moves, but then:

199:
199

200:
200

201:
201

202:
202

on all moves later top 10 coordinates are all on the 1st line

…it ends so fast and abruptly.

1 Like

Maybe we need more than just relative frequency with different color, but relative click amount with different circle size, so it would be a better indication of the actual distribution (maybe a log-scale factor for relative size, so they won’t disappear with less than 1 pixel)

3 Likes

Exactly.
Titles aren’t very good for picking teaching games.

Of course there could be more looking for different languages (French, German, Japanese and so on). But choosing a title is completely free and based on player’s taste and creativity.

I also forgot to mention that some games have “teach” in their title just because one player has “teach” in his username and the game title is like vs :grin:

I could investigate further in remaining reverse komi games: removing all those with “teach” in the title and looking at the rest.

Considering all you’ve said, I don’t think this is representative of teaching games, unfortunately.

Why did you choose “teach” instead of “teaching game”? I don’t think it would be representative still, but, it would have made more sense no?

Need a language AI to decrypt what is teaching game

1 Like

I agree.

Just because titles like “Please teach me” or "“will teach new and high kyu” seems actually tied to teaching games.
That brought also some false positives, but I can’t say which are more… big data are full of noise.

I just realised that I forgot to filter out annuled games from that sample! :grin:
There are 3555 games out of 15874 which should be removed:

Ranked
Outcome Board False True Total
Abandonment 19x19 2 2
Cancellation 9x9 126 80 206
19x19 1036 153 1189
Disconnection 9x9 8 25 33
19x19 48 6 54
Moderator Decision 9x9 2 2
19x19 5 5
Timeout 9x9 313 214 527
19x19 938 599 1537
Total 2478 1077 3555

I didn’t mention resigned games (there’s 7701 of them) since I suppose that a teaching game could legitimately end by resignation.

Back to reverse komi:

In the chart that you quoted there were 70897 games. Only 147 of them have “teach” in their title.
Here is a breackdown of their outcome:

Outcome Games
Cancellation 3594
Disconnection 440
Resignation 29389
Score 16090
Timeout 21342
Total 70855

Here are more frequent game titles:

Game Name Games
Friendly Match 14603
친선 대국 2631
友谊赛 2099
Online Lesson 1589
親睦戦 1392
友谊对局 978
Let’s play go! 851
Дружеский матч 783
友誼對局 646
Freundschaftsspiel 596
Partie amicale 391
Partida amistosa 327
go 191
i get black, you get 50 komi 151
Vriendschappelijke Wedstrijd 125
Challenge from onigaijin4649 101
Challenge from lavb 100

Looks like “Friendly game” in many languages.

1 Like

Simply the default setting (which change with your language setting, 2cts guess)

1 Like

Yes! I just deleted all moves with below average number of clicks from simulation.
move 186:
186
now these shapes are clear and make some sense
only then 1st line moves begin


2
2

4
4

6
6

12
12

24
24

50
50

100
100

200
200

250
250

300
300

3 Likes

https://forums.online-go.com/t/weak-score-estimator-and-japanese-rules/41041/70
here I showed Simple Score Estimator that works just like this:

it just floods in all 4 directions symmetrically, like this


But, I got new idea:
what if instead of flooding symmetrically, coordinates with the biggest number of clicks from real data will be painted?
each next iteration is from data(averaged by all 8 symmetries) from next move:

tr

7 Likes

I summed number of clicks of each of 300 moves of both black and white
then averaged by all 8 symmetries
then replaced most popular move with 55, the most unpopular move with 1, …
(there are only 55 really different moves on 19x19 board)

move rank res

1 2 3 8 10 9 6 4 5 7 5 4 6 9 10 8 3 2 1
2 11 12 17 19 18 16 14 13 15 13 14 16 18 19 17 12 11 2
3 12 49 53 47 54 40 46 45 50 45 46 40 54 47 53 49 12 3
8 17 53 55 44 51 41 43 42 52 42 43 41 51 44 55 53 17 8
10 19 47 44 20 39 31 35 25 36 25 35 31 39 20 44 47 19 10
9 18 54 51 39 37 34 30 27 38 27 30 34 37 39 51 54 18 9
6 16 40 41 31 34 32 28 24 33 24 28 32 34 31 41 40 16 6
4 14 46 43 35 30 28 26 23 29 23 26 28 30 35 43 46 14 4
5 13 45 42 25 27 24 23 21 22 21 23 24 27 25 42 45 13 5
7 15 50 52 36 38 33 29 22 48 22 29 33 38 36 52 50 15 7
5 13 45 42 25 27 24 23 21 22 21 23 24 27 25 42 45 13 5
4 14 46 43 35 30 28 26 23 29 23 26 28 30 35 43 46 14 4
6 16 40 41 31 34 32 28 24 33 24 28 32 34 31 41 40 16 6
9 18 54 51 39 37 34 30 27 38 27 30 34 37 39 51 54 18 9
10 19 47 44 20 39 31 35 25 36 25 35 31 39 20 44 47 19 10
8 17 53 55 44 51 41 43 42 52 42 43 41 51 44 55 53 17 8
3 12 49 53 47 54 40 46 45 50 45 46 40 54 47 53 49 12 3
2 11 12 17 19 18 16 14 13 15 13 14 16 18 19 17 12 11 2
1 2 3 8 10 9 6 4 5 7 5 4 6 9 10 8 3 2 1

Now, question is: why 5-5 point is so unpopular? Only 1st and 2nd line moves have less clicks. Moves around tengen have more clicks.
This is not center and not side, it is corner, close to most popular point.


difference between moves:
(the least popular marked as 1, point at the right is corner)

inside 1st line:
image

inside 2nd line:
image

inside 3rd line:
image

inside 4th line:
image

inside 5th line:
image

inside 6th line:
image

inside 7th line
image

inside 8th line:
image

inside 9th line:
image


difference between lines:
average of moves inside lines done, so lines are compared
most popular line marked as 10

(1st line at the top)
image

(tengen at the bottom)

3 Likes

Naturally, if something is close to the most popular point, chances are the most popular point is already occupied. Playing 5-5 when the 4-4 is already occupied does nothing, most of the time.

1 Like

I’ve been thinking about game time settings (byo yomi and fischer really) and wondering which is popular and what actual times are typically or often used.
I suppose there will be peaks at the automatch time settings (I suppose I should know what these are…) but I’d like to know what settings are actually popular (or not) compared to the discussed that there have been in the forums from time to time.

I guess this data dump has this kind of info but I also suppose it’s not completely straightforward to pull out and present things in a meaningful way with all the various combinations of main time periods, increments etc. Even just looking at Fischer and byo-yomi might be too much but I figured there’s no harm in asking!

2 Likes

Indeed the data is easy to extract, the question is just how you want it presented :slight_smile:

Here is an example of what info is stored and how:

"time_control": {
        "time_control": "byoyomi",
        "period_time": 30,
        "main_time": 600,
        "periods": 5,
        "system": "byoyomi",
        "speed": "live"
    },

I’ll take a look at it in a day or two unless someone beats me to it (re-downloading the torrent now because I switched computers since I analysed this data last time).

2 Likes

That’s amazing!
I was thinking about what the most popular time settings are and if there is a clear favourite or maybe a couple of front runners for each time setting. And if the favourites for byo-yomi and fischer were equivalent in terms of game duration/pacing.

So I suppose something that shows the distributions of main time and increment/period duration for each of blitz, live and correspondence and each of byo-yomi and fischer. Initially I feel the number of periods of byo-yomi could be ignored, which might cause issues but I suppose it will generally be 3 or 5 and maybe that’s something to explore later.

So does it make sense to have a series of graphs with time across the bottom against number of occurrences of that time on the Y axis?

I suppose there will be peaks for each automatch timing, plus others for say ladder games and site tournaments or particularly popular tournaments. But does anything else rise above background noise?

I guess games which finished by cancellation or before a few moves were played should be excluded as I’m only interested in what’s popular and actually played not what’s popular to put out there but which hardly ever really happens in practice (I’m thinking about 1s blitz maybe seeming more popular than it should be otherwise!)

(Separately, what is the actual ratio of blitz/live/correspondence? I always think of OGS as a correspondence server but I guess a lot of live games can be played while only one correspondence one is finished! Or in other words, is Lys’s observation that 25m/27m games lasted less than 24 an indication of the dominance of live over correspondence or that most correspondence games don’t get going properly?)

I shared something about that before:

I’m happy that Anton showed this:

I didn’t recall that bit.
Unfortunately I discarded it together with the rest of that section.
I wonder if I could extract that piece and merge it back into my subset of data.

2 Likes