This post is just informational for those that care or are curious about networking things, there’s nothing in it anyone needs to know.
Over the past few weeks we’ve been conducting some networking experiments to ensure we are making the right choice when it came networking. I figured the results of that experiment might be of interest to some of you, so here it is:
What we are testing
The online-go.com servers are located in the Google Cloud Platform (GCP), US-East 1 data center. We also use Cloudflare mitigate denial of service attacks as well as reduce the costs of bandwidth.
When it comes to the “real time” features of the site, such as sending and receiving moves in a game, we have the choice of three different “routes” to get those packets to the data center and back to the browser.
- Via the GCP “standard” network, which is essentially completely publicly routed
- Via the GCP “premium” network, in which packets from your browser get routed through dedicated google operated lines
- Via the Cloudflare network, in which packets from your browser get routed through the Cloudflare network and are delivered directly to the data center via a peering agreement between Cloudflare and Google.
These options all have different prices, which differ based on total bandwidth used and destination country and whatnot, but roughly speaking GCP standard is $0.085 / GB, GCP premium is $0.12 / GB, and Cloudflare is $0.04 / GB.
Cost wise it’s an easy choice to use Cloudflare, and we have been for many years now. However, at some point I began to question whether using one of the other options might provide a better experience and be worth the extra cost, so what I did was to have everyone who connected to OGS establish 3 WebSocket connections, one along each route, and periodically (every 10 seconds) send out a “ping” along each of the routes at the same time, then report the latency observed.
At a glance, all three networks were pretty similar. Below shows a map of the average latency over the past 7 days seen from each country. Green indicates latencies less than 150ms, yellow are 150-250ms, orange 250-500ms, and red 500+ms. Inspecting the minor variations between each of the graphs shows that if it was one color on one graph and another on another, they were generally within about 10ms or so of each other, and so it just happened to be that country was on the border. There were no “clear winners” that I saw, regardless of country.
I also inspected the 20th and 95th percentiles as well as the median of a few representative countries:
The conclusion I got from all of this was that it didn’t matter much. All of the times were within a few milliseconds of each other, but it does seem like both cloudflare and gcp premium did tend to both shave off a few milliseconds from that of operating over the standard network, particularly when going across the Pacific. Given that the price of going through Cloudflare is 1/3 the price of the GCP premium network, and 1/2 that of the standard network, Cloudflare remains the obvious choice for our WebSocket traffic (as well as all of our other traffic).
Anyways, I guess I was somewhat surprised that they were all so very close, although given the entities involved I suppose I shouldn’t have expected anything different. However I’m happy to have conducted the experiment so we know for sure using Cloudflare for all of our traffic is a good choice.
P.S. The other thing I attempted to measure was connection drops and reconnect events per route. However, it turns out that naively recording and reporting if a client is reconnecting vs connecting the first time is useless for this as every time a laptop is put to sleep, a mobile browser is put into the background, or in some cases a tab is left in the background, the client will be disconnected and cause it to reconnect when the device is woken up, or the browser or tab is brought to the foreground. So the only conclusion I can really draw is that if there was a difference in connection stability between any of the three routes, it was lost in the noise of reconnection events caused at the device level. Anecdotally though, monitoring my own device (which never sleeps), I did see the standard connection reset a few times, but overall all three were quite stable, so I have no reason to believe there’s a substantial difference between Cloudflare and GCP premium in terms of stability, but my hunch is that they do both offer a little bit more stability over the standard network.