Can we get an SGF database dump?

@voltrevo Yes. The JSON download will be complete around the end of September or early October. We’re not doing a generated SGF download at all, to save OGS-side resources. I can make the raw JSON results available early, I suppose. I grabbed 0-18M or so thus far.

Once the download is complete, I will do the local SGF conversion, check the results, and clean everything up (including categorizing by whether players are bots, adding documentation, etc). Then I’ll make the JSON+SGF results available in several public places. You’ll see a post here when that happens.

1 Like

@za3k Awesome. Happy to wait for the full dataset.

I have a general question about the copyright status and fair use of go games hosted on OGS.

From my understanding, there is an assumption that individual go game (the move sequence) can’t be copyrighted. But a collection of them could. So if I would want to use the the database once finished in a research paper that is online open access, do I need to get consent from the OGS team? Or a general fair use disclaimer is enough?

Also, since it is always time sensitive in publish papers, I’ve asked the user’s themselves to download and share their game records individually (with each less than 100 records per person). Do I still need to notify the OGS team when I analyze them for users’ themselves (like their views of common moves or play style for a research project)? This shouldn’t violated any of the current OGS privacy policy, right? If it could not be answer easily, who should I PM?

As far as whatever OGS’s claim might be on games or game collections played on the server, we’re fine with them being used for research.

4 Likes

@anoek, would you be okay with explicitly putting things in public domain or some open source license? I almost guarantee open-source AI developers (as opposed to academics) will want that. I’m happy to answer any legal questions to the best of my (not-a-lawyer) ability

Edit: Also sorry I should have said first, thanks for saying you’re fine with them being used for research! That’s still a huge useful step on its own.

2 Likes

Without consulting a lawyer I don’t know that that’s something I can simply do or not, so no. The best I can do is say that if it is in fact solely up to us, we don’t mind folks using the public game records played on OGS for research (so long as said research is respectful of the players). What ever other legal precedents there may or may not be on the matter in the courts around the world, I can’t speak to, so I leave navigating that as an exercise to the reader. (I am not aware of any notable pitfalls or gotchas fwiw, I just can’t give you any guarantees on the matter).

5 Likes

ETA is still about a week, but here’s some preliminary information about status codes which could be interesting for @anoek .

frequency status_code example_url
21,272,237 200 https://online-go.com/termination-api/game/28511299
4,360,504 404 https://online-go.com/termination-api/game/28511298
2,001,434 403 https://online-go.com/termination-api/game/28511281
876,042 400 https://online-go.com/termination-api/game/28353476
819 520 https://online-go.com/termination-api/game/28461434
230 502 https://online-go.com/termination-api/game/26641560
23 409 https://online-go.com/termination-api/game/23951731
11 530 https://online-go.com/termination-api/game/10778187

Also gotcha on the license, that’s understandable. I am fairly sure (in the USA) it’s in the same boat of “OGS can do whatever it wants, but they might turn out not to have had copyright in the first place once it goes to court”. Outside the US I don’t have enough expertise to say.

Preliminary download results are available here in JSON format. I’m currently fixing the missed games from the first pass. After I get things fixed, I’ll generate SGF files and post everything elsewhere.

@hexahedron if you want to improve the ranks in your script, now’s the time. Otherwise 70% chance they’ll just be going up with slightly off ranks, I don’t expect to do much better than you.

2 Likes

This couldn’t have come in better time, I just have a much smaller user self reported games for my preliminary results. And the lowest level pattern tokens might be bias toward the limited pool. These can make long term trend analysis easier.

I do find in my smaller database that the win/loss results couldn’t be trusted completely. As users’ themselves report a small portion of their wins/losses are not matching the records, like play the last move, but lost by resign. When I exam them, some are clearly killing of a local group, not self atari which are very weird (maybe some unknown bugs?), some are probably unreported sand/air bagging, and other factors (opponent feeling their strength difference is too high, don’t want to continue, teaching games, personal matters, etc. which only a handful can be determined by meta data like the chat history, game names, opponent strength history, etc). Now, my option is to scrub them manually, and mark them out of the training data (or used in validation/testing where win/loss are not features). For a full dataset this size, the scrub is clearly not able to be done manually. Some preprocessing is going to be needed.

P.S. you will be using this as the basis to convert it to sgf format, right? What would be the game name encoded?

OK, the SGF conversion is complete. All games are available for download as SGF files or as JSON files. My current organization is to have two downloads available–one organized by date, and one by username. The SGF collection is about 11GB compressed.

Current hosting is at za3k - OGS Go game collection on my home server. I will be waiting for some feedback that things look OK, and if everything looks fine I’ll upload to Internet Archive, make a torrent etc.

I will go ahead and edit the top post so people can quickly find the link, too. Edit: Nope, can’t edit it after a reply, looks like.

4 Likes

Edit access time varies according to your “level” as forum user.
Maybe you can just start a new topic? (And link this one for reference)

I can likely edit the post for you if you want to let me know what to include. The link above anyway?

1 Like

Thank you shinuito. Yes, just adding that would be fine for now.

I’ve added the first paragraph and next sentence with the link as that seems like a lot of useful info :slight_smile:

OK, I uploaded it to Internet Archive as well, since realistically I don’t expect feedback, and my little server was struggling with all the downloads :). OGS 2021 collection of Go games : Zachary Vance / OGS : Free Download, Borrow, and Streaming : Internet Archive

shinuito, if you decide to update it again might I suggest putting the links at the top of the post instead of the bottom?

4 Likes

I haven’t followed the topic, short question. Why 2013 and later only, why not include historic games from 2005-2013 there aren’t many of them anyway.

I’m referring to the line:

  • Games are included from 2013-01-29 (start of OGS) to 2021-08-29
1 Like

Thanks for the correction, 2005-11-05 is indeed where games start from. I’ll fix the documentation.

My error was probably from game ID #16, which is dated in 2013–IDs are not always chronological order, it looks like.

1 Like

Yep, because of novags and old ogs merge.

2 Likes

How many games are played or ended every day?

Will you update your database with new games?

3 Likes

I do not plan to update this more than once a year, if that.

You can download the collection and answer your other question for yourself.

2 Likes