Spoiler: If you’re part of the ADHD generation and want to skip learning and go straight to the punchline, use Redis and redis-objects for all your atomic data needs.
Brush Up Your Resume
You are probably not handling atomic operations properly in your app, and probably have some nasty lurking race conditions. The worst part is these will get worse as your user count increases, are difficult to reproduce, and usually happen in your most critical pieces of code. (And no, your rspec tests can’t catch them either.)
Let’s assume you’re writing an app to enable students to enroll in courses. You need to ensure that no more than 30 students can sign up for a given course. In your enrollment code, you have something like this:
@course = Course.find(1)
if @course.num_students < 30
@course.course_students.create!(:student_id => 101)
@course.num_students += 1
@course.save!
else
# course is full
end
You’re screwed. You now have 32 people in your 30 person class, and you have no idea what happened.
“Well no duh,” you’re saying, “even the ActiveRecord docs mention locking, so I’ll just use that.”
@course = Course.find(1, :lock => true)
if @course.num_students < 30
# ...
Nice try, but now you’ve introduced other issues. Any other piece of code in your entire app that needs to update anything about the course – maybe the course name, or start date, or location – is now serialized. If you need high concurrency, you’re screwed (still).
You think, “ah-ha, the problem is having a separate counter!”
@course = Course.find(1)
if @course.course_students.count < 30
@course.course_students.create!(:student_id => 101)
else
# course is full
end
Nope. Still screwed.
The Root Down
It’s worth understanding the root issue, and how to address it.
Race conditions arise from the difference in time between evaluating and altering a value. In our example, we fetched the record, then checked the value, then changed it. The more lines of code between those operations, and the higher your user count, the bigger the window of opportunity for other clients to get the data in an inconsistent state.
Sometimes race conditions don’t matter in practice, since often a user is only operating on their own data. This has a race condition, but is probably ok:
@user = User.find(params[:id])
@post = Post.create(:user_id => @user.id, :title => "Whattup")
@user.total_posts += 1 # update my post count
But this would be problematic:
@blog = Blog.find(params[:id])
@post = Post.create(:blog_id => @blog.id, :title => "Whattup")
@blog.total_posts += 1 # update post count across all users
As multiple users could be adding posts concurrently.
In a traditional RDBMS, you can increment counters atomically (but NOT return them) by firing off an update statement that self-references the column:
update users set total_posts = total_posts + 1 where id = 372
You may have seen ActiveRecord’s increment_counter class method, which wraps this functionality. This solves half the problem of updating the counters atomically. But this has the serious the side effect that your object is no longer in sync with the DB, so you get other issues:
@blog = Blog.find(params[:id])
Blog.increment_counter :total_posts, @blog.id
if @blog.total_posts == 1000
# the 1000th poster - award them a gold star!
The DB says 1000, but your @blog object still says 999, and the right person doesn’t get their gold star. Sad faces all around.
A Better Way
Bottom line: Any operation that can alter a value must return that value in the same operation for it to be atomic. If you do a separate get then set, or set then get, you’re open to a race condition. There are very few systems that support an “increment and return” type operation, and Redis is one of them (Oracle sequences are another).
When you think of the specific things that you need to ensure, many of these will reduce to numeric operations:
- Ensuring there are no more than 30 students in a course
- Getting more than 2 but less than 6 people in a game
- Keeping a chat room to a max of 50 people
- Correctly recording the total number of blog posts
- Only allowing one piece of code to reorder a large dataset at a time
All except the last one can be implemented with counters. The last one will need a carefully placed lock.
The best way I’ve found to balance atomicity and concurrency is, for each value, actually create two counters:
- A counter you base logic on (eg, slots_taken)
- A counter users see (eg, current_students)
The reason you want two counters is you’ll need to change the value of the logic counter first, before checking it, to address any race conditions. This means the value can get wonky momentarily (eg, there could be 32 slots_taken for a 30-person course). This doesn’t affect its function – indeed, it’s part of what makes it work – but does mean you don’t want to display it.
So, taking our Course example:
class Course < ActiveRecord::Base
include Redis::Objects
counter :slots_taken
counter :current_students
end
Then:
@course = Course.find(1)
@course.slots_taken.increment do |val|
if val <= @course.max_students
@course.course_students.create!(:student_id => 101)
@course.current_students.increment
end
end
Race-condition free. Why? Because we’re checking the direct result of the increment operation against a value. The set and get operations are one and the same, which is the crucial piece. If that code block returns false, the counter is rewound, and no animals were harmed in this atomic op.
Then, due to the current_students counter, your views get consistent information about the course, since it will only be incremented on success. There is still a race condition where current_students could be less than the real number of CourseStudent records, but since you’ll be displaying these values in a view (after that block completes) you shouldn’t see this manifest in real-world usage.
Now you can sleep soundly, without fear of getting fired at 3am via an angry phone call from your boss. (At least, not about this…)
Berkes
July 25, 2010
I don’t know any popular CMS that has proper measures in place for race conditions. It is but a matter of time untill exploits are engineered that abuse race conditions to either break an app, or even break into it.
Happy to see some attention in the framework world for this. Thanks for the writeup and well thought out solution.
dstroma
November 7, 2010
This is what row-level locking was meant for. Although you cite performance concerns, the vast majority of applications do not need high row-level write concurrency. In your example, the amount of time it takes to obtain the lock, check the students counter, and commit the transaction is negligible. I don’t see why you should reinvent the wheel.
I don’t think transactions getting serialized due to locks is a bad thing. Having multiple people update the same row at once is exactly what I do NOT want.
Nate Wiger
November 19, 2010
Well, it’s not like I had never heard of row-level locking. Exceeding the DB’s capacity for handling row-level locks is exactly what led to this solution. DB locks were running a massive machine up to 100% CPU. It may be true that many apps don’t need a separate atomicity solution, but when you’re running at large scale, you do.
dstroma
April 4, 2011
Waiting for a lock does not use any CPU.
Nate Wiger
April 4, 2011
Not true.
http://bugs.mysql.com/bug.php?id=49047
And http://www.mysickql.com/?p=78
dstroma
April 4, 2011
Interesting, but the fact that you needed to cite a bug report speaks for itself. The blog post did not mention you were using MySQL, and the fact that this whole exercise was a work-around for a bug was not mentioned either.
Nate Wiger
April 4, 2011
I hear your point, but it’s not just a workaround for a bug. I liked the bug report because it explains the real-world constraints of architecture decisions in detail.
From http://en.wikipedia.org/wiki/Lock_(computer_science)#Types
I’m not sure where you got the absolute statement “waiting for a lock does not use any CPU”. This sounds like something a textbook or CS professor would say. In UN*X, the “wait” state technically does not use CPU, but that’s not the same as saying creating, maintaining, and destroying locks has no overhead. It’s just not true.
Zack Gramana
April 19, 2011
Excellent post. Clearly, not everyone has worked on applications that challenge typical concurrency techniques.