@voltrevo Yes. The JSON download will be complete around the end of September or early October. We’re not doing a generated SGF download at all, to save OGS-side resources. I can make the raw JSON results available early, I suppose. I grabbed 0-18M or so thus far.
Once the download is complete, I will do the local SGF conversion, check the results, and clean everything up (including categorizing by whether players are bots, adding documentation, etc). Then I’ll make the JSON+SGF results available in several public places. You’ll see a post here when that happens.
I have a general question about the copyright status and fair use of go games hosted on OGS.
From my understanding, there is an assumption that individual go game (the move sequence) can’t be copyrighted. But a collection of them could. So if I would want to use the the database once finished in a research paper that is online open access, do I need to get consent from the OGS team? Or a general fair use disclaimer is enough?
Also, since it is always time sensitive in publish papers, I’ve asked the user’s themselves to download and share their game records individually (with each less than 100 records per person). Do I still need to notify the OGS team when I analyze them for users’ themselves (like their views of common moves or play style for a research project)? This shouldn’t violated any of the current OGS privacy policy, right? If it could not be answer easily, who should I PM?
@anoek, would you be okay with explicitly putting things in public domain or some open source license? I almost guarantee open-source AI developers (as opposed to academics) will want that. I’m happy to answer any legal questions to the best of my (not-a-lawyer) ability
Edit: Also sorry I should have said first, thanks for saying you’re fine with them being used for research! That’s still a huge useful step on its own.
Without consulting a lawyer I don’t know that that’s something I can simply do or not, so no. The best I can do is say that if it is in fact solely up to us, we don’t mind folks using the public game records played on OGS for research (so long as said research is respectful of the players). What ever other legal precedents there may or may not be on the matter in the courts around the world, I can’t speak to, so I leave navigating that as an exercise to the reader. (I am not aware of any notable pitfalls or gotchas fwiw, I just can’t give you any guarantees on the matter).
Also gotcha on the license, that’s understandable. I am fairly sure (in the USA) it’s in the same boat of “OGS can do whatever it wants, but they might turn out not to have had copyright in the first place once it goes to court”. Outside the US I don’t have enough expertise to say.
Preliminary download results are available here in JSON format. I’m currently fixing the missed games from the first pass. After I get things fixed, I’ll generate SGF files and post everything elsewhere.
@hexahedron if you want to improve the ranks in your script, now’s the time. Otherwise 70% chance they’ll just be going up with slightly off ranks, I don’t expect to do much better than you.
This couldn’t have come in better time, I just have a much smaller user self reported games for my preliminary results. And the lowest level pattern tokens might be bias toward the limited pool. These can make long term trend analysis easier.
I do find in my smaller database that the win/loss results couldn’t be trusted completely. As users’ themselves report a small portion of their wins/losses are not matching the records, like play the last move, but lost by resign. When I exam them, some are clearly killing of a local group, not self atari which are very weird (maybe some unknown bugs?), some are probably unreported sand/air bagging, and other factors (opponent feeling their strength difference is too high, don’t want to continue, teaching games, personal matters, etc. which only a handful can be determined by meta data like the chat history, game names, opponent strength history, etc). Now, my option is to scrub them manually, and mark them out of the training data (or used in validation/testing where win/loss are not features). For a full dataset this size, the scrub is clearly not able to be done manually. Some preprocessing is going to be needed.
P.S. you will be using this as the basis to convert it to sgf format, right? What would be the game name encoded?
OK, the SGF conversion is complete. All games are available for download as SGF files or as JSON files. My current organization is to have two downloads available–one organized by date, and one by username. The SGF collection is about 11GB compressed.
Current hosting is at za3k - OGS Go game collection on my home server. I will be waiting for some feedback that things look OK, and if everything looks fine I’ll upload to Internet Archive, make a torrent etc.
I will go ahead and edit the top post so people can quickly find the link, too. Edit: Nope, can’t edit it after a reply, looks like.