Real-time Leaderboards with ElastiCache for Redis

With the launch of AWS ElastiCache for Redis this week, I realized my redis-objects gem could use a few more examples. Paste this code into your game’s Ruby backend for real-time leaderboards with Redis. Redis Sorted Sets are the ideal data type for leaderboards. This is a data structure that guarantees uniqueness of members, plus keeps members sorted in real time. Yep that’s pretty much exactly what we want. The Redis sorted set commands to populate a leaderboard would be:

ZADD leaderboard 556  "Andy"
ZADD leaderboard 819  "Barry"
ZADD leaderboard 105  "Carl"
ZADD leaderboard 1312 "Derek"

This would create a leaderboard set with members auto-sorted based on their score. To get a leaderboard sorted with highest score as highest ranked, do:

ZREVRANGE leaderboard 0 -1
1) "Derek"
2) "Barry"
3) "Andy"
4) "Carl"

This returns the set’s members sorted in reverse (descending) order. Refer to the Redis docs for ZREVRANGE for more details.

Wasn’t this a Ruby post?

Back to redis-objects. Let’s start with a direct Ruby translation of the above:

require 'redis-objects'
Redis.current = Redis.new(host: 'localhost')

lb = Redis::SortedSet.new('leaderboard')
lb["Andy"]  = 556
lb["Barry"] = 819
lb["Carl"]  = 105
lb["Derek"] = 1312

puts lb.revrange(0, -1)  # ["Derek", "Barry", "Andy", "Carl"]

And… we’re done. Ship it.

Throw that on Rails

Ok, so our game probably has a bit more too it. Let’s assume there’s a User database table, with a score column, created like so:

class CreateUsers < ActiveRecord::Migration
  def up
    create_table :users do |t|
      t.string  :name
      t.integer :score
    end
  end
end

We can integrate a sorted set leaderboard with our User model in two lines:

class User < ActiveRecord::Base
  include Redis::Objects
  sorted_set :leaderboard, global: true
end

Since we’re going to have just a single leaderboard (rather than one per user), we use the global flag. This will create a User.leaderboard sorted set that we can then access anywhere:

puts User.leaderboard.members

(Important: This doesn’t have to be ActiveRecord — you could use Mongoid or DataMapper or Sequel or Dynamoid or any other DB model.)

We’ll add a hook to update our leaderboard when we get a new high score. Since we now have a database table, we’ll index our sorted set by our ID, since it’s guaranteed to be unique:

class User < ActiveRecord::Base
  include Redis::Objects
  sorted_set :leaderboard, global: true

  after_update :update_leaderboard
  def update_leaderboard
    self.class.leaderboard[id] = score
  end
end

Save a few records:

User.create!(name: "Andy",  score: 556)
User.create!(name: "Barry", score: 819)
User.create!(name: "Carl",  score: 105)
User.create!(name: "Derek", score: 1312)

Fetch the leaderboard:

@user_ids = User.leaderboard.revrange(0, -1)
puts @user_ids  # [4, 2, 1, 3]

And now we have a Redis leaderboard sorted in real time, auto-updated any time we get a new high score.

But MySQL has ORDER BY

The skeptical reader may wonder why not just sort in MySQL, or whatever the kewl new database flavor of the week is. Outside of offloading our main database, things get more interesting when we want to know our own rank:

class User < ActiveRecord::Base
  # ... other stuff remains ...

  def my_rank
    self.class.leaderboard.revrank(id) + 1
  end
end

Then:

@user = User.find(1) # Andy
puts @user.my_rank   # 3

Getting a numeric rank for a row in MySQL would require adding a new “rank” column, and then running a job that re-ranks the entire table. Doing this in real time means clobbering MySQL with a global re-rank every time anyone’s score changes. This makes MySQL unhappy, especially with lots of users.

Kids are calling so that’s all for now. Enjoy!

Atomic Rant Redux

My atomic rant has gotten a ton of traffic – more than I foresaw.  Seems atomicity is a hot topic in the web world these days. Increasing user concurrency, coupled with more interactive apps, exposes all sorts of edge cases. I wanted to write a follow-up post to step back and look at a few more high-level concerns with atomicity, as well as some Redis-specific issues we’ve seen.

Know Your Actors

new-moon-official-castIn my original rant, I used the example of students enrolling in online classes to illustrate why atomicity was crucial to operations with multiple actors. And speaking of actors, they’re an even better target analogy. You need to assume your actors are all going to try to jam through the audition door at the same time. What happens if they are all talking to the director at once? How many conversations can continue in parallel? If you’re careful, you can get away with one final gate at the end, which makes your life infinitely easier. That is, funnel everyone to a decision point, congratulate one person, then tell the others sorry.

Of course, if that funnel is too long, you’re going to piss off your users in a major way. If you’ve ever bought tickets from Ticketmaster, you’re familiar with this problem. Granted they’ve gotten much better over the years (which is saying something…), and this is partially due to embracing the Amazon guesses and apologies approach. If you have 200 tickets left, a person can probably get one. But if you have 10 tickets left, they’re probably going to get screwed. If you can help with the user’s expectations (“less than 10 tickets left!”) then people are more likely to be forgiving.

In the world of online games, this translates to showing players the number of slots left in a game, but then handing the situation where there were 2 slots left but you were the third person to hit “Submit”. You always need to handle these errors, because there’s no way to completely eliminate race conditions in a networked application.

Recovering from Hiccups

isharescapsizeSooner or later, your slick, smooth-running atomic system is going to have problems. Even if it’s well-engineered, you could have a large outage such as a system crash, datacenter failure, etc. Plan on it.

Using Redis to offload atomic ops from the DB yielded big performance benefits, but added fragility. You now have two systems that must stay in sync. If either one crashes, there’s the possibility that you’re going to have dangling locks for records that are ok, or vice-versa. So you need a way to clear them. In a perfect world with infinite time, you’d be able to engineer a self-detecting, self-repairing system that can auto-recover. Good luck with that. A cron job that deletes locks older than a certain time works pretty well for the rest of us.

It’s also a good idea to have a script you can run manually, in the event you know you need to reset certain things. For example, to handle the case where you know your Redis node went down, you could have a script that deletes all locks where the ID is > the current max ID in the DB. Oracle and other systems have similar concepts built into their native locking procedures

Troubleshooting Redis is a Pain

Unfortunately, Redis is lacking in the way of tools because it is still young. There is the PHP Redis Admin app, but its development appears to have stalled. Beyond that it’s pretty much roll-your-own-scripts at this point. We’ve thought about developing a general-purpose Redis app/tool ourselves, but with the Redis 2.0 changes and VMWare hiring Salvatore the tools side is a bit “wait and see”.

So before you start throwing all of your critical data into Redis, realize it’s a bit black-box at this point (or at least, a really dark gray). I’m not a GUI guy personally – I prefer command-line tools due to my sysadmin days – but for many programmers, GUI tools help debugging a lot. You need to make sure your programmers working with Redis can debug it when you have problems, which means a bigger investment in scripts vs. just downloading MySQL Workbench or Oracle SQL Developer

Check and Double-Check

The last thing worth mentioning is this: Don’t trust your own app. Even if you have an atomic gate at the start of a transaction, do sanity checking at the end too. There are a few reasons for this:

  • The lock may have expired for some reason, and you didn’t test for this
  • Your locking server may have crashed when you’re in the middle of a transaction
  • There could be a background job overlapping with a front-end transaction
  • Your software may have bugs (improbable, I know)

For example, we had a background job that was using the same lock as a front-end service. This ended up being a design mistake, but it was difficult to track down because it happened very infrequently. The only way we found it was we had assertions that would get hit periodically on supposedly impossible conditions. Once we correlated the times with the background job running, we were able to fix the issue rather quickly.

So my opinion is this: Try to do the right thing, but if it screws up, apologize to the user, recover, and move on.

An Atomic Rant

You are probably not handling atomic operations properly in your app, and probably have some nasty lurking race conditions. The worst part is these will get worse as your user count increases, are difficult to reproduce, and usually happen in your most critical pieces of code. (And no, your unit tests can’t catch them either.)

Spoiler: If you’re part of the ADHD generation and want to skip learning and go straight to the punchline, use Redis and redis-objects for all your atomic data needs.

Brush Up Your Resume

Let’s assume you’re writing an app to enable students to enroll in courses. You need to ensure that no more than 30 students can sign up for a given course. In your enrollment code, you have something like this:

@course = Course.find(1)
if @course.num_students < 30   @course.course_students.create!(:student_id => 101)
  @course.num_students += 1
  @course.save!
else
  # course is full
end

You’re screwed. You now have 32 people in your 30 person class, and you have no idea what happened.

“Well no duh,” you’re saying, “even the ActiveRecord docs mention locking, so I’ll just use that.”

@course = Course.find(1, :lock => true)
if @course.num_students < 30
  # ...

Nice try, but now you’ve introduced other issues. Any other piece of code in your entire app that needs to update anything about the course – maybe the course name, or start date, or location – is now serialized. If you need high concurrency, you’re screwed (still).

You think, “ah-ha, the problem is having a separate counter!”

@course = Course.find(1)
if @course.course_students.count < 30    @course.course_students.create!(:student_id => 101)
else
  # course is full
end

Nope. Still screwed.

The Root Down

It’s worth understanding the root issue, and how to address it.

Race conditions arise from the difference in time between evaluating and altering a value. In our example, we fetched the record, then checked the value, then changed it. The more lines of code between those operations, and the higher your user count, the bigger the window of opportunity for other clients to get the data in an inconsistent state.

Sometimes race conditions don’t matter in practice, since often a user is only operating on their own data. This has a race condition, but is probably ok:

@user = User.find(params[:id])
@post = Post.create(:user_id => @user.id, :title => "Whattup")
@user.total_posts += 1  # update my post count

But this would be problematic:

@blog = Blog.find(params[:id])
@post = Post.create(:blog_id => @blog.id, :title => "Whattup")
@blog.total_posts += 1  # update post count across all users

As multiple users could be adding posts concurrently.

In a traditional RDBMS, you can increment counters atomically (but not return them) by firing off an update statement that self-references the column:

update users set total_posts = total_posts + 1 where id = 372

You may have seen ActiveRecord’s increment_counter class method, which wraps this functionality. This solves half the problem of updating the counters atomically. But this has the significant side effect that your object is no longer in sync with the DB, so you get other issues:

@blog = Blog.find(params[:id])
Blog.increment_counter :total_posts, @blog.id
if @blog.total_posts == 1000
  # the 1000th poster - award them a gold star!

The DB says 1000, but your @blog object still says 999, and the right person doesn’t get their gold star. Sad faces all around.

A Better Way

Bottom line: Any operation that can alter a value must also return that value in the same operation for it to be atomic. If you do a separate get then set, or set then get, you’re open to a race condition. There are only a few systems that support an “increment and return” type operation, and Redis is one of them (Oracle sequences are another, and Postgres supports “update returning”).

When you think of the specific things that you need to ensure, many of these will reduce to numeric operations:

  • Ensuring there are no more than 30 students in a course
  • Getting more than 2 but less than 6 people in a game
  • Keeping a chat room to a max of 50 people
  • Correctly recording the total number of blog posts
  • Only allowing one piece of code to reorder a large dataset at a time

All except the last one can be implemented with counters. The last one will need a carefully placed lock.

The best way I’ve found to balance atomicity and concurrency is, for each value, actually create two counters:

  • A counter you base logic on (eg, slots_taken)
  • A counter users see (eg, current_students)

The reason you want two counters is you’ll need to change the value of the logic counter firstbefore checking it, to address any race conditions. This means the value can get wonky momentarily (eg, there could be 32 slots_taken for a 30-person course). This doesn’t affect its function – indeed, it’s part of what makes it work – but does mean you don’t want to display it.

So, taking our Course example:

class Course < ActiveRecord::Base
  include Redis::Objects

  counter :slots_taken
  counter :current_students
end

Then:

@course = Course.find(1)
@course.slots_taken.increment do |val|
  if val <= @course.max_students     @course.course_students.create!(:student_id => 101)
    @course.current_students.increment
  end
end

Race-condition free. Why? Because we’re checking the direct result of the increment operation against a value. The set and get operations are one and the same, which is the crucial piece. If that code block returns false, the counter is rewound, and no animals were harmed in this atomic op.

Then, due to the current_students counter, your views get consistent information about the course, since it will only be incremented on success. There is still a race condition where current_students could be less than the real number of CourseStudent records, but since you’ll be displaying these values in a view (after that block completes) you shouldn’t see this manifest in real-world usage.

Now you can sleep soundly, without fear of getting fired at 3am via an angry phone call from your boss. (At least, not about this…)