Game Analytics with AWS at GDC 2014

I gave a talk at GDC 2014 all about game analytics and AWS. In the talk, I showed how to start small by uploading analytics files from users devices to S3, and then processing them with Redshift. As your game grows, add more data sources and AWS services such as Kinesis and Elastic MapReduce to perform more complex processing. Here are the slides on Slideshare and the videos on YouTube.

Free-to-play has become a ubiquitous strategy for publishing games, especially mobile and social games. Succeeding in free-to-play requires having razor-sharp analytics on your players, so you know what they love and what they hate. Free-to-play aside, having an awesome game has always been about maximizing the love and minimizing the hate. Charge a reasonable price for the things your players love and you have a simple but effective monetization strategy.

At the end of the talk, I blabbed a bit about what I see as the future of gaming: Big data and real-time analytics. The more in-tune you can get with your players, and the faster you can react, the more your game is going to differentiate itself. Recently there was a massive battle in EVE Online that destroyed nearly $500,000 worth of ships and equipment. Imagine being able to react in real-time, in the heat of battle, offering players discounted ammunition targeted at their fleet and status in battle. Some estimate impulse buys to account for 40% of all ecommerce meaning there is huge untapped potential for gaming in the analytics space.

Real-time Leaderboards with ElastiCache for Redis

With the launch of AWS ElastiCache for Redis this week, I realized my redis-objects gem could use a few more examples. Paste this code into your game’s Ruby backend for real-time leaderboards with Redis. Redis Sorted Sets are the ideal data type for leaderboards. This is a data structure that guarantees uniqueness of members, plus keeps members sorted in real time. Yep that’s pretty much exactly what we want. The Redis sorted set commands to populate a leaderboard would be:

ZADD leaderboard 556  "Andy"
ZADD leaderboard 819  "Barry"
ZADD leaderboard 105  "Carl"
ZADD leaderboard 1312 "Derek"

This would create a leaderboard set with members auto-sorted based on their score. To get a leaderboard sorted with highest score as highest ranked, do:

ZREVRANGE leaderboard 0 -1
1) "Derek"
2) "Barry"
3) "Andy"
4) "Carl"

This returns the set’s members sorted in reverse (descending) order. Refer to the Redis docs for ZREVRANGE for more details.

Wasn’t this a Ruby post?

Back to redis-objects. Let’s start with a direct Ruby translation of the above:

require 'redis-objects'
Redis.current = Redis.new(host: 'localhost')

lb = Redis::SortedSet.new('leaderboard')
lb["Andy"]  = 556
lb["Barry"] = 819
lb["Carl"]  = 105
lb["Derek"] = 1312

puts lb.revrange(0, -1)  # ["Derek", "Barry", "Andy", "Carl"]

And… we’re done. Ship it.

Throw that on Rails

Ok, so our game probably has a bit more too it. Let’s assume there’s a User database table, with a score column, created like so:

class CreateUsers < ActiveRecord::Migration
  def up
    create_table :users do |t|
      t.string  :name
      t.integer :score
    end
  end
end

We can integrate a sorted set leaderboard with our User model in two lines:

class User < ActiveRecord::Base
  include Redis::Objects
  sorted_set :leaderboard, global: true
end

Since we’re going to have just a single leaderboard (rather than one per user), we use the global flag. This will create a User.leaderboard sorted set that we can then access anywhere:

puts User.leaderboard.members

(Important: This doesn’t have to be ActiveRecord — you could use Mongoid or DataMapper or Sequel or Dynamoid or any other DB model.)

We’ll add a hook to update our leaderboard when we get a new high score. Since we now have a database table, we’ll index our sorted set by our ID, since it’s guaranteed to be unique:

class User < ActiveRecord::Base
  include Redis::Objects
  sorted_set :leaderboard, global: true

  after_update :update_leaderboard
  def update_leaderboard
    self.class.leaderboard[id] = score
  end
end

Save a few records:

User.create!(name: "Andy",  score: 556)
User.create!(name: "Barry", score: 819)
User.create!(name: "Carl",  score: 105)
User.create!(name: "Derek", score: 1312)

Fetch the leaderboard:

@user_ids = User.leaderboard.revrange(0, -1)
puts @user_ids  # [4, 2, 1, 3]

And now we have a Redis leaderboard sorted in real time, auto-updated any time we get a new high score.

But MySQL has ORDER BY

The skeptical reader may wonder why not just sort in MySQL, or whatever the kewl new database flavor of the week is. Outside of offloading our main database, things get more interesting when we want to know our own rank:

class User < ActiveRecord::Base
  # ... other stuff remains ...

  def my_rank
    self.class.leaderboard.revrank(id) + 1
  end
end

Then:

@user = User.find(1) # Andy
puts @user.my_rank   # 3

Getting a numeric rank for a row in MySQL would require adding a new “rank” column, and then running a job that re-ranks the entire table. Doing this in real time means clobbering MySQL with a global re-rank every time anyone’s score changes. This makes MySQL unhappy, especially with lots of users.

Kids are calling so that’s all for now. Enjoy!

Atomic Rant Redux

My atomic rant has gotten a ton of traffic – more than I foresaw.  Seems atomicity is a hot topic in the web world these days. Increasing user concurrency, coupled with more interactive apps, exposes all sorts of edge cases. I wanted to write a follow-up post to step back and look at a few more high-level concerns with atomicity, as well as some Redis-specific issues we’ve seen.

Know Your Actors

new-moon-official-castIn my original rant, I used the example of students enrolling in online classes to illustrate why atomicity was crucial to operations with multiple actors. And speaking of actors, they’re an even better target analogy. You need to assume your actors are all going to try to jam through the audition door at the same time. What happens if they are all talking to the director at once? How many conversations can continue in parallel? If you’re careful, you can get away with one final gate at the end, which makes your life infinitely easier. That is, funnel everyone to a decision point, congratulate one person, then tell the others sorry.

Of course, if that funnel is too long, you’re going to piss off your users in a major way. If you’ve ever bought tickets from Ticketmaster, you’re familiar with this problem. Granted they’ve gotten much better over the years (which is saying something…), and this is partially due to embracing the Amazon guesses and apologies approach. If you have 200 tickets left, a person can probably get one. But if you have 10 tickets left, they’re probably going to get screwed. If you can help with the user’s expectations (“less than 10 tickets left!”) then people are more likely to be forgiving.

In the world of online games, this translates to showing players the number of slots left in a game, but then handing the situation where there were 2 slots left but you were the third person to hit “Submit”. You always need to handle these errors, because there’s no way to completely eliminate race conditions in a networked application.

Recovering from Hiccups

isharescapsizeSooner or later, your slick, smooth-running atomic system is going to have problems. Even if it’s well-engineered, you could have a large outage such as a system crash, datacenter failure, etc. Plan on it.

Using Redis to offload atomic ops from the DB yielded big performance benefits, but added fragility. You now have two systems that must stay in sync. If either one crashes, there’s the possibility that you’re going to have dangling locks for records that are ok, or vice-versa. So you need a way to clear them. In a perfect world with infinite time, you’d be able to engineer a self-detecting, self-repairing system that can auto-recover. Good luck with that. A cron job that deletes locks older than a certain time works pretty well for the rest of us.

It’s also a good idea to have a script you can run manually, in the event you know you need to reset certain things. For example, to handle the case where you know your Redis node went down, you could have a script that deletes all locks where the ID is > the current max ID in the DB. Oracle and other systems have similar concepts built into their native locking procedures

Troubleshooting Redis is a Pain

Unfortunately, Redis is lacking in the way of tools because it is still young. There is the PHP Redis Admin app, but its development appears to have stalled. Beyond that it’s pretty much roll-your-own-scripts at this point. We’ve thought about developing a general-purpose Redis app/tool ourselves, but with the Redis 2.0 changes and VMWare hiring Salvatore the tools side is a bit “wait and see”.

So before you start throwing all of your critical data into Redis, realize it’s a bit black-box at this point (or at least, a really dark gray). I’m not a GUI guy personally – I prefer command-line tools due to my sysadmin days – but for many programmers, GUI tools help debugging a lot. You need to make sure your programmers working with Redis can debug it when you have problems, which means a bigger investment in scripts vs. just downloading MySQL Workbench or Oracle SQL Developer

Check and Double-Check

The last thing worth mentioning is this: Don’t trust your own app. Even if you have an atomic gate at the start of a transaction, do sanity checking at the end too. There are a few reasons for this:

  • The lock may have expired for some reason, and you didn’t test for this
  • Your locking server may have crashed when you’re in the middle of a transaction
  • There could be a background job overlapping with a front-end transaction
  • Your software may have bugs (improbable, I know)

For example, we had a background job that was using the same lock as a front-end service. This ended up being a design mistake, but it was difficult to track down because it happened very infrequently. The only way we found it was we had assertions that would get hit periodically on supposedly impossible conditions. Once we correlated the times with the background job running, we were able to fix the issue rather quickly.

So my opinion is this: Try to do the right thing, but if it screws up, apologize to the user, recover, and move on.