OGS Pattern Search Database

Also, about the data structures, from Making a hash of it • Life In 19x19 we have

As a reference point, on my machine Kombilo seems to use about 2G on a 110,000-game database that occupies about 15MB

So 40m/110k * 2GB = 726 GB

Beefy

Also links to Search Algorithms at Sensei's Library

2 Likes

That’s really cool! Thanks for the tip, somehow I never heard about it until now (though I am also a relatively new kid on the block, only learning Go existed and joining OGS in May 2017)

But Kombilo (a magnificent beast) allows you to make searches of rectangles of any size within a cache built (as I infer) from 19x19 hashing.

Ohh it’s saving everything. Now I see what you’re saying.

When I read OP, I envisioned an index of whole board positions (which I think can be optimized pretty well). Pattern matching local positions is indeed a problem that shouldn’t be solved!

1 Like

Fwiw, waltheri also does nxn rectangle local pattern matching :stuck_out_tongue:

So you envision looking up how many games opened with q16 as opposed to 4-4, is that right? There would be no matching for rotation or mirrors?

I think good-enough matching can be done pretty easily. Most games lose symmetry by like move 3, so just assume all games start in the proper octant and go from there.

Edit: I should clarify, I’m not sure how useful this is. Just what I thought the original ask was.

1 Like

I think if the intention is making OGS specific joseki, it would be quite critical to prioritise local position over global, since apart from ladder variations I think most kyu players give little consideration to the global position when selecting which local joseki to play…

2 Likes

Can’t we make it easier by splitting the records into individual ranks and process them separately? And would be interesting to know which josekis become more common and change from rank to rank.

Sure, that would be possible. But every search key you add for fast and efficient searching, requires adding some sort of index to your database, increasing the size of the database and requiring some amount of compute to populate the index.
So there is always a price to be payed upfront for things like this and database designers/administrators would weight that price against the projected benefits.

2 Likes

In the massive bulk download mentioned above by AdamR, I filtered all games where at least one player’s username contains “berdud”.

I see 481 games: 223 as black, 253 as white and five games “Uberdude vs Uberdude”!
There are also 7 uploaded SGF.

If these are the games you’re looking for, they’re available both as JSON file and as SGF collection following the links. You have to discard the rest of the 27M games, though. :slight_smile:

3 Likes

May I ask as a beginner (possibly outsider), how do you guys “filtered” games? Just asking one programming terminology at a time.

2 Likes

What I did was:

  • download the full dataset from za3k page on Internet Archive (the full JSON file, not SGF)
  • purge all informations that wasn’t interesting for me, in order to reduce the dataset size (as an example, I discarded all moves)
  • save remaining informations in a new smaller database

Now I can easily filter games using one of those remaining informations (board size, dates, ranked/unranked, player’s names, outcome, score, handicap, ruleset and so on).
By “filtering” I mean to summarize just those games which meet my conditions. In this case: at least one player’s name contains the string “berdud”.
This is commonly called “query” when talking about databases. There are many languages and tools to do that.

2 Likes

Well, I get the general idea, although with more terminologies (I think they are?) that confused me (like JSON, purge, reduce? what do you reduce a database? string? is it just a name? and the “query” obviously) I get their general meaning, but still have no idea how you do it? (just like in Go someone knows the general meaning of “territory”, doesn’t mean they can estimate, or play territory moves)

How did you guys “query” it exactly without taking forever? I sometimes saw people post codes in post like this

How do you make it works?

I actually know a little about query database using excel, but it’s just select some files, click some dialogs and select some range of columns. With so many games, I feel there must be a faster way, by the “language” and “tools” you are referring to I assume.

Since you know Excel, let’s use analogies with that.

Original dataset is something like a huge sheet.
Let’s say that each game is a row.
All informations for that game are cells of different columns.
To reduce sheet size I deleted many columns and then saved it into another file.

Size matters when doing queries.

Queries are similar to Excel filters.
In the column “Black username” you can select a single name and look only at his games, all other rows are temporarily hidden. In the “start date” column you can pick a year, in the “board size” column you can choose only 9x9 and 19x19.
This way you can see less rows.

Another thing that you can do with queries is to summarise, which is similar to count rows.
So, instead of having millions of rows for all single games, you can have only one for each board size, with the number of games played on that board.
I think this is called pivot in Excel. I’m not sure.

Excel also draws charts, which is what I usually do with data, but using another software called Tableau. It’s a program that can do queries, summarise and draw charts. This activity is called “business intelligence”.

The same can be done with different languages (programming languages specific for queries, such as SQL) and softwares (qlik view, power bi).

3 Likes

Thanks for all your time to answer my question, and even though I should be study my finals, but I am still very curious how it actually works.

One major question I have is that if each game is a row, how do you fit a two-dimensional go game records into each cell in the columns? I saw all the replies above talked about searching patterns, and I was still no close to understand about “every search key you add for fast and efficient searching, requires adding some sort of index to your database”, or “increasing the size of the database and requiring some amount of compute to populate the index” in @gennan’s reply to my post.

Why do we have to add and increase the size of the database (I assume index is the number of rows in your explanations?), and what does “populate the index” even mean? Wasn’t splitting the game records in different ranks will make them smaller?

Row is just a conceptual analogy, and does not necessarily restrict the form of the data within each cell of the rows. For example, in some databases, images or other arbitrary files could be associated with each row, and this might ultimately be accomplished with the cell just pointing to the location of where the file might be stored elsewhere.

In the case of game records, the data is typically stored as text via the SGF file format FF[4].

2 Likes

So how you filter or search the SGF files inside the cell? I don’t know how the analogy of Excel row and column would actually work?

You could parse out some key information from each SGF as meta-data that is stored in the other columns.

Each game record (SGF file) corresponds to a row, where one column could the raw contents of the SGF file itself, or maybe just the text of the move tree, while other columns could be each player’s name, their ranks, the result of the game, timing information, board size, etc.

Then, searching means using a program to select only the rows where certain columns match the specific criteria that you are looking for.

I’m not an expert with Excel, but I think it should be possible to take a large table made of many rows and select just the rows where the values in a particular column matches some criteria. For example, imagine that one column corresponds to white’s rank, and one wants to select all games where white is greater than 1 dan, then one could run a search that returns just the rows where that particular column indicates a rank larger 1 dan.

Excel is just an analogy here, but various database systems provide all sorts of ways to manipulate, query, filter, sort, and search through data. The key is that information in a database is carefully structured in a manner where it becomes straightforward to select based on certain values.

1 Like

Put aside the Excel analogy, what does this “text of the move tree” means? I now understand quite well how we can filter the names, size, date, etc. But moves are placed on a two-dimensional board, I still don’t understand how do we filter each move made by two players in each game, and link them together to different players in the database?

A very short SGF file might look something like this:

(;SZ[19];B[aa];W[bb];B[cc];W[dd];B[ad];W[bd])

Here, the “move tree” is just a single straight path with no variations. Each node is prefixed by a semi-colon, and everything is surround by parentheses, since that is part of the standard.

Here is what it means:
;SZ[19] start with an empty 19x19 board
B[aa] black plays a stone at the 1-1 point in the top-left corner
W[bb] white plays a stone at the 2-2 point in the top-left corner
B[cc] black plays at the 3-3 point
W[dd] white plays at the 4-4 point
B[ad] black plays at 1-4
W[bd] white plays at 2-4

This is a nonsensical game, of course, but just meant to illustrate how the game history is stored in a text format of an SGF file.

It is also possible to encode variations into an SGF file. This page gives some examples: SGF - Variations

2 Likes