Can we get an SGF database dump?

UPDATES

OK, the SGF conversion is complete. All games are available for download as SGF files or as JSON files. My current organization is to have two downloads available–one organized by date, and one by username. The SGF collection is about 11GB compressed.

Current hosting is at za3k - OGS Go game collection on my home server.

OK, I uploaded it to Internet Archive as well OGS 2021 collection of Go games : Zachary Vance / OGS : Free Download, Borrow, and Streaming : Internet Archive


Original Post

Can we have a public database dump of all the games up until 2021 (now) from the admins? I would be happy to host it indefinitely. I run my own webserver, can make a torrent, and can put it on archive.org (Internet Archive).

SGF format is probably useful for the most people. We would probably want to separate human-human, human-bot, and bot-bot games, but that could be done after the dump if needed.

Same request as in (but this link is dead)

and later in

Othewise, I’m going to scrape game JSON + SGFs from the API, but that’s not really the best use of resources for either side. It looks like it would take two years with the rate limits in place, among other problems.

Related, documentation on throttling, i.e. “Please keep requests under X/second” would be good in the API docs.

14 Likes

Also definitely interested in this. Although OGS is not the only source of games, this would still very likely help with some research I’m interested in doing about using human amateur game data as training and testing material for experimental methods of improving KataGo’s analysis specifically to be more friendly to kyu players. (And as modern ML is exceedingly data-hungry, getting as many different data sources as possible in large bulk would be great).

5 Likes

@anoek I suppose is the one to ask here, tagging for if he doesn’t see it

1 Like

We generate SGFs on request, so we don’t have a big database laying around to provide.

Grabbing those are currently throttled to 1/s

3 Likes

In that case, would it be possible to privately as a one-shot event provide a raw database dump of the necessary fields to reconstruct SGF files?

I think the main things needed other than the moves would be the recorded player ratings/ranks/usernames for those games and fields needed to construct the game outcome (e.g. win/loss by score/resign/time, or game unfinished, etc) and basic metadata like board size and time control and ruleset used. Any other fields could be omitted such as other internal server fields, and definitely ones related to user private info, etc. For my purposes I also don’t need game comments, reviews made of a game, or AI reviews/annotations either. (although if game comments exist in the normal SGF download, then we could have those for consistency).

I would be willing on my own to do all the work to build the SGF files offline, then provide them to @za3k for public hosting. This could be a one-time event as a way to work around overtaxing the servers that the rate limits are there to protect without it taking forever to work through the backlog of existing games, thereafter normal queries to the server under the rate limit should presumably be enough to keep up with future games.

1 Like

It’s not a quick nor easy thing to do a database dump, we use a cassandra cluster to store all the game state across several servers.

The JSON api is a lot less intensive and provides you with the game data without the comments and all of that, simply hit https://online-go.com/termination-api/game/<id> and you’ll see the json. That endpoint you can hit faster, I’m not quite sure what the practical limit is though. I’d say start at 20/s and see how that goes - please reach out to me as you’re doing it and I can monitor the system and let you know if you can go higher or if you need to scale back.

5 Likes

I was getting significantly rate limited at the first endpoint (/api/v1/game/) already – around 2/s. I tried before I reached out. To clarify, when I say “rate-limited” I mean 429 errors, not any kind of throttling.

@anoek I will try the new endpoint. I’m seeing it take 5s to respond (vs 0.2s for /api/v1/game/). Are you sure this is less intensive?

Would you like me to do things in parallel to get up to 20/s? My user agent includes “za3k” if you want to add any kind of rate limit exception.

One thing both are missing (vs the SGF export) is comments (i.e. chat), which are nice-to-have. Otherwise I would guess we could generate nearly identical SGF files, yes.

Also as a note, SGF export fails for games before a certain point, with 500 errors. I can give you the cutoff if you like, would just need a quick bisect. Edit: Cutoff is around 1200, very very early apparently. The error: “Cannot create property ‘speed’ on string ‘fischer’”

I imagine it’s probably better not to have the chat though. While I understand that when the game is public, the chat is of course available to anyone with enough patience to find the games that have the chat and read them, one can imagine some users taking exception to someone downloading all their sgfs including any conversations they had with their opponents.

For example there’s no real system in place for a person to opt out of such a data dump right? Even if such a person was to request their account be anonymised after their games were already scraped.

2 Likes

Yes, I agree with the chat being excluded, seems like a terrible idea to include it.

1 Like

You’ll want to do it in parallel yes

Games that don’t exist (never started or matches never accepted for instance) will take 5s to timeout, that’s just a biproduct of how things work to deal with underlying race condition avoidance code.

Looks like you were right about missing games, and I just picked a bad test case.

@anoek OK, I’ll be ramping things up now. I set the rate limit to 20/s. If you encounter any issues let me know. It seems like we’re on pretty different sleep schedules, so feel free to block my IP or user agent in an emergency. If that’s fine, I’ll try doubling things tomorrow.

P.S. This endpoint returns a 400 error which should probably be a 404.

Ramping up to 40/s now. If there are no problems I’ll try doubling tomorrow.

I immediately start getting 429 exceptions at even 30/s. I’ll keep it to 20/s for now.

If that’s something you could raise (it’s not causing real problems) that would be convenient.

Current eta would be late September, already not that bad.

What an interesting conversation

1 Like

@za3k
Looks like you have gone ahead and gotten things rolling, nice!. :smiley: Anyways, if you haven’t already tried to write your own, here is some python I wrote just now that converts jsons to sgfs:

Also included there is a bash script that downloads some jsons and sgfs and performs a whitespace-ignoring word-level diff on the results by calling out to git, to test it out.

  • The ranks are not correct in some of these test cases, and I don’t know how to get them to be correct. Maybe it has something to do with OGS’s past rank system overhauls? Some are correct, but some of them are very wrong, like Leela Zero having a rank of 5k.
  • I took the liberty of making the sgf report wins by 0 points as “Jigo” according to normal SGF conventions, and to report wins by moderator decision as wins by “F” (forfeit) which is also a game end condition generally seen in SGFs. I don’t know if there’s a different way to report this that would be better, and I also don’t know if there are other win/loss conditions that need to be handled.
  • Unlike OGS, this sgf produces something all on one line and without unnecessary nesting of moves. It’s slightly shorter, and maybe slightly nicer to naive SGF parsers.
  • The SGF doesn’t have comments since comments aren’t in the json, as discussed above.
  • I took the liberty of including a tiny bit of extra info using the “GC” (game comment) tag not present in the OGS’s sgfs - namely whether the game was ranked, and ogs’s description of the game speed “e.g. blitz, live, correspondence”. These should be useful to have for anyone using this as a dataset.
  • I wanted to try to also include info on whether each player was a bot or not, since OGS has that information on its site, but it doesn’t seem to be present in the json itself, so I wasn’t able to.
  • Some time controls and rules and game results and other game options are very rare. I’m not entirely sure of handling all the corner cases correctly. (Also forks? forks that start on white’s turn? free vs fixed placement handicap? unusual-sized boards? other things I’m forgetting about? UTC or some other time zone for converting time stamps to dates?). The game_ids currently in the test script are nowhere near enough coverage, they’re just what I was debugging on by hand as I was writing the code.

Anyways, hopefully this is already almost correct, and close to something usable for converting the whole mass of jsons when you have them. Feel free to suggest fixes or whatever.

1 Like

Thanks! I was waiting to see if you got to it this weekend before I started on anything. Looks good, but I don’t know the SGF format well myself. I’d suggest adding some error handling for if we can’t convert JSON, making a note of it and moving on. I can add this.

Currently I have all the JSON in a single stream for speed. Opening and closing millions of files is pretty slow. I’ll add some custom logic to deal with this and generate SGFs files with reasonable names etc. I’ll measure how fast this can process things, too. I assume you’re fine with my including your generator source in the final output?

Yep, go ahead, do whatever you want with it. I also might spend a bit more time later to see if I can fix the highly bogus ranks (one way or another, this needs a fix this before using this for real), and maybe test out a few more corner cases.

You could search the forums for issues with ranks. There could be a few possible things, from the fact the rank is 30-x where you get x from the rating conversion, to the fact there’s a bunch of different rank numbers in the Jsons (I’m think of the older api/v1 which probably aren’t the ones in use).

@Lys for example ran into a bunch of issues with funny rating jumps when making graphs for one or the huge tournaments, to see how many games
finished etc.

1 Like

There’s some events going on this weekend, I’ll up the limit after them at some point @za3k

3 Likes

@za3k Are you still pursuing this? I’d love to get a copy. I’m also happy to share what I have so far, which is both the metadata and generated SGF in the range 31,000,000 - 35,999,999.

3 Likes