Has anyone updated the OGS game dump since 2021? If not, then I’d be happy to catch it up from where it was left off.
Was the approach to hit the termination API for each game ID in sequence and save the resulting JSON objects (for all the ones that are publicly accessible)? So I should find the largest ID from the 2021 dump and just run a huge for loop from that ID onwards?
It would be nice to have another game dump (and an annual update in future years). Lots of scope for doing database searches to see how joseki and fuseki are evolving at “mere mortal” level as opposed to pro level.
I’ve started another download from the point @za3k left off, that is from game 36611000. It’s making a bit of progress at 20/s. FYI @anoek, LMK if it’s any trouble.
Actually, I have it updated to 2023, I just didn’t get around to publishing it. Sorry! I was (maybe false positive) having some issues post-processing the JSON into SGF collections, and then got distracted.
If you bug me again next week (I’m about to drive out of state ATM) I can update it. [of course feel free to download in parallel too] My email is za3k@za3k.com if you want to coordinate.
@anoek, if I can hijack the thread a bit… is there any documentation on what “rank” means, in terms of displaying a kyu/dan rating? And is rank always guaranteed to be at the time of the game played?
Thanks anoek. I thought I was seeing something else, but probably I was mistaken. Will check when I’m back in state.
@siimphh I’ve resumed downloading from ID=56,130,746 . I’ll post the JSON once I get it. ETA is maybe June 7. (Download first pass, plus a second pass for any games that failed the first time around.) I’ll upload the JSON immediately this time so you have a copy, and only later deal with the SGF generation for the general public.
Hehe, my download had gotten as far as 53518706 but I’ve stopped it for now then and I’ll wait for yours to finish and grab it when ready! Thanks for picking it up again!
For my purposes, I’m definitely more interested in the raw json because I’m fishing out some extra pieces of metadata that don’t typically end up in sgf (ranked vs unranked and broad game clock categories, not using the final board counting result yet but that’s also interesting). So yeah, if you can make a json download available when it’s ready then I’d appreciate it ahead of sgf, even though maybe more people would be looking at the sgf files in general.
Oh shoot! I’ve started mine up as well, maybe it will also be useful to compare notes later about which games are found and which ones aren’t. So far I had seen stretches of game ID ranges that all return 400, and some individual 409 if I recall correctly. Other errors seem to be retriable.
I’d also set up writing files by game start time and was very surprised to find occasional games with wildly out of sequence start times, by months or even years.
@anoek Has requested that siimphh and I both stop crawling for the time being, because the servers are getting too loaded. I’ve stopped on my end.
@siimphh, let’s make sure at most one of us is crawling OGS at a time once we restart, to avoid the combined load of two crawlers. I’ll let you finish everything you want to do first.
Please coordinate with @anoek to figure out how crawl in a reasonable fashion (endpoint, rate, monitoring to make sure nothing is overloaded, so on).
I’ll keep an eye on this thread and any DMs. Until I hear a go-ahead from both of you, my crawler is stopped.