The job title Director sucks

I loathe the term “Director”, because it connotes the opposite behavior of what senior leaders should really be doing.

A movie director tells every person what to do – where to stand, how to act, which cameras should shoot from what angles, which lighting to use, and so forth.  This works great for the movies, or theater, or dance, where you have a predefined script that you can start and stop at any moment.


This is a horrible analogy for fast-moving companies.  First, you can’t start and stop the action at any time.  Second, it’s not possible to have your arms around an entire product being developed. Even movie directors only shoot one scene at a time.

The other connotation of director is somebody who sets the overall direction. This is somewhat true, but at a fast-moving company with talented people, does one person really have all the good ideas? No.


Facilitator is a far better word: a person who makes an action or process easier. Facilitators help people overcome obstacles, resolve conflicts, and grow. They don’t tell you what to do. They help you figure out what to do for yourself.

Now, I’m not saying Indiana Jones and the Temple of Doom was a bad movie, or that Spielberg is a bad director. It’s that the paradigm of “directing” doesn’t work for fast-moving companies. And the problem with using the word “director” is that every time you hear it, it connotes the same mismatched behaviors. If you’re the Director, nothing can happen without you. Instant bottleneck.

Instead, you need to empower your people and unblock them, but otherwise get out of their way.  There’s a great book called The Coaching Habit that offers fantastic practical advice on how to listen to people and coach them, so that they take the initiative to solve problems themselves.

Try This

Try changing your title and seeing what happens.  Odds are you can’t change your official HR title, but you may have an internal phone tool, Exchange directory, or business cards.  Try changing your title on them to “Facilitator”. See how it affects your mindset and the mindset of others!

If you want to hone your skills, check out this blog on the 9 characteristics of a good facilitator.


Nate’s Stock Market Theory of Management

I’m a fan of simplifying as much as possible. One strategy I love to use is analogies.

I’m sure you’ve heard the phrase “it’s like riding a bicycle”. We all know what that means – once you’ve learned a skill, you can take a break from it, but regain it quickly if you start again later.  Saying “it’s like riding a bicycle” encompasses not only this concept, but brings along a richness of emotion. Those summer days as a kid, riding your bike to the pool to meet up with some friends, maybe to grab an ice cream afterwards.

Managing teams is like stock market investing

Life at a fast-moving company is full of swings, highs, and lows.  In software, there may be launches, bugs, or service outages that cause different individuals in the organization to go back and forth rapidly. In operations, there can be holiday sales, labor strikes, or equipment issues that cause huge variations in day-to-day work. This churn is often visible, through email escalations, phone alerts, or literal flashing red lights.

Managers often fail in one of their most important responsibilities: providing stability for their teams.  As a manager, you are guiding your teams, helping them release products and triage issues. But, you’re not sitting side-by-side with every engineer, experiencing every bug fix with them.  Your job is to smooth out bumps and valleys, and keep the team together as a unit. In times of crisis, you are there to calm them.  In times of change, you are there to guide them through.

You are a smoothing function, like a moving average in a stock market graph.

moving average mgmt

The amount of smoothing you do is dependent on your role.  As a front-line manager (the red trend line), you need to respond to day-to-day events that impact your team. But as a more senior manager, you should not respond too quickly.  Overreacting to daily events leads to “knee-jerk reactions” or “seagull management“.

As you move into more senior management roles, you take on a broader perspective, and a longer-term view.  Rather than managing one team and thinking daily or weekly, you are managing multiple teams and thinking monthly or quarterly (the green trend line).  At the executive level, you are looking out 6-12 months and creating multi-year plans (the blue trend line).  Your job is to provide a stable vision for the team, a North Star to navigate towards.  In stock market terms, you are a daily, then a 50-day, then a 200-day moving average for your team.

Keeping in Sync

The stock market moving average analogy can be taken further.  You’ll notice in the graph I chose that each of the management layers is somewhat out-of-sync.  It gets particularly pronounced in the middle section, where lines cross and move in opposite directions.  In our analogy, this could represent a change in strategy, or an internal reorganization. Eventually, leadership realigns, and the team can move forward.

moving average mgmt change

Notice that while leadership is not aligned, the team vacillates back and forth. When teams report feeling “churn”, this is what they are feeling.

The longer time period a moving average reflects, the more it can be out-of-sync with daily events. In the above graph, the Director and VP/SVP levels are stable on the down and up swings. But their version of reality is not entirely in sync with what is happening day-to-day. This is a common challenge for senior leadership.

This is where empowerment comes in.  I would argue the above graph is healthy, if the front-line manager (the red trend line) is empowered appropriately. The long-term trend lines focus on the long-term, and the short-term trend lines focus on the short-term.

Having spent almost 7 years at Amazon, this is what Amazon does really well.  Every “two-pizza team” owns their own destiny – the tools they use, the coding methods they follow, the internal systems they reuse (or don’t), the scrum discipline they adopt, and so forth.

I witnessed numerous Amazon new hires experience extreme culture shock. Over the years I heard people comment that it seemed like “general anarchy”, “barely controlled chaos”, or even simply, “I can’t believe a company can operate this way”.  But consider the reverse. Think if every little decision about every feature, or toolset, or architecture choice had to go up to the VP layer (the blue trend line). The entire company would grind to a halt. Instead, it’s a rocket ship.

Try This

First, figure out what smoothing line you are supposed to be.  Are you a front-line manager? Director? SVP? Make sure you are acting appropriately.

Second, are you empowering your people? Empowerment is a big one that pays off. If you challenge people with a stretch goal and tell them “I believe in you”, they can do amazing things. (Shocking, I know.)

Finally, if you have any great stock tips, let me know.

Donut based security at Amazon

This is not a clever technical article where DONUT is some obscure new encryption algorithm. This is about getting people to lock their laptop screens. Using donuts.

In the early days of the Amazon San Diego office, we were in an unsecured, shared office space with other companies. As such, it was crucial that people remembered to lock their screens whenever they left their computers, even if only for a few minutes. But, humans forget, and we needed a way to actively catch them and help correct their behavior.

Cyber theft bad guy ooooo!

A normal company probably would have put up posters and sent out emails about the importance of locking screens – which would have been promptly deleted and ignored. Or had the managers reminding employees about the importance of security blah blah blah. Or created a 45-minute training video about the dangers of cybertheft with spooky looking cartoon bad guys.

What we did was this.

If we stumbled upon an unlocked laptop screen, we would send out an email from that person’s account:


Subject: Free donuts tomorrow!

Hey everyone, I realized we haven’t had donuts in a while, so I’ll bring in a box tomorrow for everyone!  Enjoy!

You had to be fast, since the person could come back at any moment. The key was to not get caught, so they had no idea who did it.

If your computer was used to send such an email, you were duty-bound to bring donuts the next day. No ifs, ands, or buts. You had been donuted.  We consumed some ungodly creations, like these from VG:


This was remarkably effective.  Over time the donuts decreased in frequency, which was a little disappointing from a stomachular perspective, but showed it was effective from a security perspective.

There are a few reasons why this worked:

  1. We made it a game. Everyone could participate. It was fun.
  2. Humans hate being embarrassed. This was a mild sting, but it still stung enough for people to remember.
  3. There was an inconvenience factor. Now you had to drive and buy donuts tomorrow.
  4. There was social pressure. Everyone knew what was expected. Nobody ever failed to bring in donuts.

The cool thing is this game spread organically. I remember sending the first email out. It was a coworker’s computer – a senior Amazonian who should have known better. He dutifully brought donuts in the next day.  From there it caught on like wildfire. (I also donuted him another half dozen times before he learned – I swear he was the worst at locking his screen!)

Once the precedent was set, the game was on. Who said enforcing security was no fun?



How Amazon ended up in San Diego

One of my favorite career accomplishments so far was founding the Amazon San Diego office.  I wrote the 6-pager proposal about the city and presented it to Amazon VP/SVPs to gain approval.  I was employee #1 in the office, hiring a team that grew to 300+ people in less than 3 years. I chose all three locations – first the temp space in Solana Beach, then the interim space in UTC, then the final office at Campus Point where the office lives today.  I got to meet the mayor and be interviewed on TV which was a ton of fun.  You can read all about it in this Amazon Day One blog article.  There’s another more lighthearted article that talks about the space itself on Hatch.

download.jpegWhat those articles don’t highlight is the huge team effort it took to get us there.  Each manager at the office owned a different major initiative. One oversaw our hiring pipeline, while another ran mixers and events, while another keep the office stocked with snacks and beer. At the very start, the office didn’t have a printer, so another leader and I drove to Staples, bought one, brought it back in his truck, and plugged it in. We had 4 people squeezed into 2-person offices and it was a blast.

Hiring the right people, mentoring them to become leaders, and leaving behind an office that continues to thrive was incredibly rewarding.  It taught me a ton about what it means to be a good leader.  It’s not about telling people what to do or delegating tasks. It’s about inspiring people with a big hairy goal, empowering and supporting them, but then generally getting the hell out of their way.

This can be a very uncomfortable feeling.  If you do it right, you won’t have details on everything happening under your watch. People will be making decisions on their own and acting autonomously. You’ll find out things that have been happening for weeks that you were completely unaware of.  The trick is, when you find out, do those things make you go “cool!” or “holy sh*t why are they doing THAT??”

I’m not going to lie, there were definitely several “oh shit” moments – but most of the time it was pleasant surprises. Credit goes to the phenomenal management and senior engineering talent we were able to hire. For my part, I tried to be clear in what was most important. Rather than telling people what to do, I just tried to inspire them by why we were there in the first place.  We were colonizing a whole new city for Amazon – a pretty audacious goal to be part of.

Our mission as I saw it was:

  1. Prove to people that San Diego was a real tech city – real enough to support top-tier companies like Amazon.
  2. Create a culture where people were proud of their work, felt appreciated, and had a fun time overall.
  3. Bring the San Diego influence back into Amazon, and demonstrate that you could do awesome things while still having fun.

That’s basically it.  I like keeping things simple. 🙂

To everyone who was involved (and continues to be involved), all I can say is thank you.  It was an experience I’ll cherish the rest of my life and it’s hard for me to put into words how humbled I feel meeting so many incredible people.  Can’t wait to hear when the office hits 1,000 people!


Game Analytics with AWS at GDC 2014

I gave a talk at GDC 2014 all about game analytics and AWS. In the talk, I showed how to start small by uploading analytics files from users devices to S3, and then processing them with Redshift. As your game grows, add more data sources and AWS services such as Kinesis and Elastic MapReduce to perform more complex processing. Here are the slides on Slideshare and the videos on YouTube.

Free-to-play has become a ubiquitous strategy for publishing games, especially mobile and social games. Succeeding in free-to-play requires having razor-sharp analytics on your players, so you know what they love and what they hate. Free-to-play aside, having an awesome game has always been about maximizing the love and minimizing the hate. Charge a reasonable price for the things your players love and you have a simple but effective monetization strategy.

At the end of the talk, I blabbed a bit about what I see as the future of gaming: Big data and real-time analytics. The more in-tune you can get with your players, and the faster you can react, the more your game is going to differentiate itself. Recently there was a massive battle in EVE Online that destroyed nearly $500,000 worth of ships and equipment. Imagine being able to react in real-time, in the heat of battle, offering players discounted ammunition targeted at their fleet and status in battle. Some estimate impulse buys to account for 40% of all ecommerce meaning there is huge untapped potential for gaming in the analytics space.

Real-time Leaderboards with ElastiCache for Redis

With the launch of AWS ElastiCache for Redis this week, I realized my redis-objects gem could use a few more examples. Paste this code into your game’s Ruby backend for real-time leaderboards with Redis. Redis Sorted Sets are the ideal data type for leaderboards. This is a data structure that guarantees uniqueness of members, plus keeps members sorted in real time. Yep that’s pretty much exactly what we want. The Redis sorted set commands to populate a leaderboard would be:

ZADD leaderboard 556  "Andy"
ZADD leaderboard 819  "Barry"
ZADD leaderboard 105  "Carl"
ZADD leaderboard 1312 "Derek"

This would create a leaderboard set with members auto-sorted based on their score. To get a leaderboard sorted with highest score as highest ranked, do:

ZREVRANGE leaderboard 0 -1
1) "Derek"
2) "Barry"
3) "Andy"
4) "Carl"

This returns the set’s members sorted in reverse (descending) order. Refer to the Redis docs for ZREVRANGE for more details.

Wasn’t this a Ruby post?

Back to redis-objects. Let’s start with a direct Ruby translation of the above:

require 'redis-objects'
Redis.current = 'localhost')

lb ='leaderboard')
lb["Andy"]  = 556
lb["Barry"] = 819
lb["Carl"]  = 105
lb["Derek"] = 1312

puts lb.revrange(0, -1)  # ["Derek", "Barry", "Andy", "Carl"]

And… we’re done. Ship it.

Throw that on Rails

Ok, so our game probably has a bit more too it. Let’s assume there’s a User database table, with a score column, created like so:

class CreateUsers < ActiveRecord::Migration
  def up
    create_table :users do |t|
      t.string  :name
      t.integer :score

We can integrate a sorted set leaderboard with our User model in two lines:

class User < ActiveRecord::Base
  include Redis::Objects
  sorted_set :leaderboard, global: true

Since we’re going to have just a single leaderboard (rather than one per user), we use the global flag. This will create a User.leaderboard sorted set that we can then access anywhere:

puts User.leaderboard.members

(Important: This doesn’t have to be ActiveRecord — you could use Mongoid or DataMapper or Sequel or Dynamoid or any other DB model.)

We’ll add a hook to update our leaderboard when we get a new high score. Since we now have a database table, we’ll index our sorted set by our ID, since it’s guaranteed to be unique:

class User < ActiveRecord::Base
  include Redis::Objects
  sorted_set :leaderboard, global: true

  after_update :update_leaderboard
  def update_leaderboard
    self.class.leaderboard[id] = score

Save a few records:

User.create!(name: "Andy",  score: 556)
User.create!(name: "Barry", score: 819)
User.create!(name: "Carl",  score: 105)
User.create!(name: "Derek", score: 1312)

Fetch the leaderboard:

@user_ids = User.leaderboard.revrange(0, -1)
puts @user_ids  # [4, 2, 1, 3]

And now we have a Redis leaderboard sorted in real time, auto-updated any time we get a new high score.


The skeptical reader may wonder why not just sort in MySQL, or whatever the kewl new database flavor of the week is. Outside of offloading our main database, things get more interesting when we want to know our own rank:

class User < ActiveRecord::Base
  # ... other stuff remains ...

  def my_rank
    self.class.leaderboard.revrank(id) + 1


@user = User.find(1) # Andy
puts @user.my_rank   # 3

Getting a numeric rank for a row in MySQL would require adding a new “rank” column, and then running a job that re-ranks the entire table. Doing this in real time means clobbering MySQL with a global re-rank every time anyone’s score changes. This makes MySQL unhappy, especially with lots of users.

Kids are calling so that’s all for now. Enjoy!

Linux Network Tuning for 2013

Linux distributions still ship with the assumption that they will be multi-user systems, meaning resource limits are set for a normal human doing day-to-day desktop work. For a high-performance system trying to serve thousands of concurrent network clients, these limits are far too low. If you have an online game or web app that’s pushing the envelope, these settings can help increase awesomeness.

The parameters we’ll adjust are as follows:

  • Increase max open files to 100,000 from the default (typically 1024). In Linux, every open network socket requires a file descriptor. Increasing this limit will ensure that lingering TIME_WAIT sockets and other consumers of file descriptors don’t impact our ability to handle lots of concurrent requests.
  • Decrease the time that sockets stay in the TIME_WAIT state by lowering tcp_fin_timeout from its default of 60 seconds to 10. You can lower this even further, but too low, and you can run into socket close errors in networks with lots of jitter. We will also set tcp_tw_reuse to tell the kernel it can reuse sockets in the TIME_WAIT state.
  • Increase the port range for ephemeral (outgoing) ports, by lowering the minimum port to 10000 (normally 32768), and raising the maximum port to 65000 (normally 61000). Important: This means you can’t have server software that attempts to bind to a port above 9999! If you need to bind to a higher port, say 10075, just modify this port range appropriately.
  • Increase the read/write TCP buffers (tcp_rmem and tcp_wmem) to allow for larger window sizes. This enables more data to be transferred without ACKs, increasing throughput. We won’t tune the total TCP memory (tcp_mem), since this is automatically tuned based on available memory by Linux.
  • Decrease the VM swappiness parameter, which discourages the kernel from swapping memory to disk. By default, Linux attempts to swap out idle processes fairly aggressively, which is counterproductive for long-running server processes that desire low latency.
  • Increase the TCP congestion window, and disable reverting to TCP slow start after the connection is idle. By default, TCP starts with a single small segment, gradually increasing it by one each time. This results in unnecessary slowness that impacts the start of every request – which is especially bad for HTTP.

Ok, enough chat, more code.

Kernel Parameters

To start, edit /etc/sysctl.conf and add these lines:

# /etc/sysctl.conf
# Increase system file descriptor limit
fs.file-max = 100000

# Discourage Linux from swapping idle processes to disk (default = 60)
vm.swappiness = 10

# Increase ephermeral IP ports
net.ipv4.ip_local_port_range = 10000 65000

# Increase Linux autotuning TCP buffer limits
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE
# Don't set tcp_mem itself! Let the kernel scale it based on RAM.
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# Make room for more TIME_WAIT sockets due to more clients,
# and allow them to be reused if we run out of sockets
# Also increase the max packet backlog
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 10

# Disable TCP slow start on idle connections
net.ipv4.tcp_slow_start_after_idle = 0

# If your servers talk UDP, also up these limits
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192

# Disable source routing and redirects
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0

# Log packets with impossible addresses for security
net.ipv4.conf.all.log_martians = 1

Since some of these settings can be cached by networking services, it’s best to reboot to apply them properly (sysctl -p does not work reliably).

Open File Descriptors

In addition to the Linux fs.file-max kernel setting above, we need to edit a few more files to increase the file descriptor limits. The reason is the above just sets an absolute max, but we still need to tell the shell what our per-user session limits are.

So, first edit /etc/security/limits.conf to increase our session limits:

# /etc/security/limits.conf
# allow all users to open 100000 files
# alternatively, replace * with an explicit username
* soft nofile 100000
* hard nofile 100000

Next, /etc/ssh/sshd_config needs to make sure to use PAM:

# /etc/ssh/sshd_config
# ensure we consult pam
UsePAM yes

And finally, /etc/pam.d/sshd needs to load the modified limits.conf:

# /etc/pam.d/sshd
# ensure pam includes our limits
session required

You can confirm these settings have taken effect by opening a new ssh connection to the box and checking ulimit:

$ ulimit -n

Why Linux has evolved to require 4 different settings in 4 different files is beyond me, but that’s a topic for a different post. 🙂

TCP Congestion Window

Finally, let’s increase the TCP congestion window from 1 to 10 segments. This is done on the interface, which makes it a more manual process that our sysctlsettings. First, use ip route to find the default route, shown in bold below:

$ ip route
default via dev eth0 proto kernel dev eth0  proto kernel  scope link  src

Copy that line, and paste it back to the ip route change command, adding initcwnd 10 to the end to increase the congestion window:

$ sudo ip route change default via dev eth0 proto kernel initcwnd 10

To make this persistent across reboots, you’ll need to add a few lines of bash like the following to a startup script somewhere. Often the easiest candidate is just pasting these lines into /etc/rc.local:

defrt=`ip route | grep "^default" | head -1`
ip route change $defrt initcwnd 10

Once you’re done with all these changes, you’ll need to either bundle a new machine image, or integrate these changes into a system management package such as Chef or Puppet.

Additional Reading

The above settings were pulled together from a variety of other resources out there, and then validated through testing on EC2. You may need to tweak the exact limits depending on your application’s profile. Below are a few additional posts that make good reading:


Replacing Macbook HD with an SSD

My poor little laptop hard drive had been whining and whimpering, so I upgraded it to an SSD. Turned out to be inexpensive and very DIY friendly, so here are my cliffs notes.

Step 1: Choose an SSD

Mucho fasto SSDThe consensus is that Other World Computing (OWC) makes the most Mac-compatible SSD’s. I went with the OWC Mercury Extreme Pro SSD. 120GB cost me $149. If you have an older Macbook (pre-2011), or just want to save money, you can go with the slightly slower OWC Mercury Electra SSDinstead. I sprung for FedEx 2-day shipping for ~$10.

Step 2: Buy a USB Drive Case

This is so you can attach the new drive to your laptop temporarily, to copy over your data. Needs to be a 2.5” SATA for the SSD, with a USB connection for the laptop. Amazon has the Vantec NexStar 2.5-Inch SATA to USB 2.0 External Enclosure for $7.99. Done.

Step 3: Put Drive in Case

Open the NexStar drive case, and plug the OWC SSD into the connector. Close it up and attach it to your laptop via the USB cable. This step should seem very simple. If not, rethink continuing w/o help.

Step 4: (Optional) Grab a Beer

Drake’s Denogginzer goes well with upgrade-related tasks. Warning: With 22oz at 9.75%, the clock is now ticking.

Step 5: Partition the Drive

Disk Utility Window

Once you attach the drive, a window will popup saying something like “Unrecognized drive format”. Click the “Initialize” button to open up Disk Utility. You should see a screen like the one at right. Click the “Partition” button in the right pane, and do the following:

  1. Create a partition with all the available space, named whatever you want. I called mine “SSD Boot HD”.
  2. Click “+” to add a partition named “Recovery HD” of at least 750 MB in size. This is required for OSX Lion, Mountain Lion, or later, or if you’re using FileVault (disk encryption).

Both should be the default type of “Mac OSX Extended (Journaled)”. It’s important that the “Recovery HD” partition be second, because of restrictions on how Lion/Mountain Lion can and can’t resize boot partitions.

Step 6: Clone the Drive

Carbon Copy Cloner

Download Carbon Copy Cloner and install it. Theres’s a fully-functional 30-day trial so you can decide whether to purchase a license later. It’s a great program and worth supporting if possible.

When it first starts up, it’ll ask you if you want to see the “Quick Start Guide”. Say yes. It opens up instructions telling you exactly how to copy your existing hard drive to a new external drive.

All you do is select your existing drive on the left, probably “Macintosh HD”, and your new drive on the right (whatever you called it in Step 5), and click “Clone”.

You may get a popup saying something like, “Recovery HD partition does not contain the correct OS.” If so, follow the on-screen instructions to update it. I found CCC didn’t properly reset itself after this, so I had to exit, re-launch, and then click “Clone” again to start the clone.

Step 7: Wait

Sip on your beer from Step 4.

Step 8: Shutdown Mac, Swap Drives

Once the clone is finished, shutdown and unplug the power cable. Pull the external drive out of the case, reversing Step 3. Then, follow these excellent instructionsto physically install the SSD in your Macbook. Requires a teeny tiny midget screwdriver.

Step 9: Boot Mac, Enjoy

Everything should Just Work™, although I did notice that some programs like Dropbox required me to reenter my email/password the first time. For fun, try clicking on a beastly program like Photoshop or Word and it should open up unnervingly fast.

Atomic Rant Redux

My atomic rant has gotten a ton of traffic – more than I foresaw.  Seems atomicity is a hot topic in the web world these days. Increasing user concurrency, coupled with more interactive apps, exposes all sorts of edge cases. I wanted to write a follow-up post to step back and look at a few more high-level concerns with atomicity, as well as some Redis-specific issues we’ve seen.

Know Your Actors

new-moon-official-castIn my original rant, I used the example of students enrolling in online classes to illustrate why atomicity was crucial to operations with multiple actors. And speaking of actors, they’re an even better target analogy. You need to assume your actors are all going to try to jam through the audition door at the same time. What happens if they are all talking to the director at once? How many conversations can continue in parallel? If you’re careful, you can get away with one final gate at the end, which makes your life infinitely easier. That is, funnel everyone to a decision point, congratulate one person, then tell the others sorry.

Of course, if that funnel is too long, you’re going to piss off your users in a major way. If you’ve ever bought tickets from Ticketmaster, you’re familiar with this problem. Granted they’ve gotten much better over the years (which is saying something…), and this is partially due to embracing the Amazon guesses and apologies approach. If you have 200 tickets left, a person can probably get one. But if you have 10 tickets left, they’re probably going to get screwed. If you can help with the user’s expectations (“less than 10 tickets left!”) then people are more likely to be forgiving.

In the world of online games, this translates to showing players the number of slots left in a game, but then handing the situation where there were 2 slots left but you were the third person to hit “Submit”. You always need to handle these errors, because there’s no way to completely eliminate race conditions in a networked application.

Recovering from Hiccups

isharescapsizeSooner or later, your slick, smooth-running atomic system is going to have problems. Even if it’s well-engineered, you could have a large outage such as a system crash, datacenter failure, etc. Plan on it.

Using Redis to offload atomic ops from the DB yielded big performance benefits, but added fragility. You now have two systems that must stay in sync. If either one crashes, there’s the possibility that you’re going to have dangling locks for records that are ok, or vice-versa. So you need a way to clear them. In a perfect world with infinite time, you’d be able to engineer a self-detecting, self-repairing system that can auto-recover. Good luck with that. A cron job that deletes locks older than a certain time works pretty well for the rest of us.

It’s also a good idea to have a script you can run manually, in the event you know you need to reset certain things. For example, to handle the case where you know your Redis node went down, you could have a script that deletes all locks where the ID is > the current max ID in the DB. Oracle and other systems have similar concepts built into their native locking procedures

Troubleshooting Redis is a Pain

Unfortunately, Redis is lacking in the way of tools because it is still young. There is the PHP Redis Admin app, but its development appears to have stalled. Beyond that it’s pretty much roll-your-own-scripts at this point. We’ve thought about developing a general-purpose Redis app/tool ourselves, but with the Redis 2.0 changes and VMWare hiring Salvatore the tools side is a bit “wait and see”.

So before you start throwing all of your critical data into Redis, realize it’s a bit black-box at this point (or at least, a really dark gray). I’m not a GUI guy personally – I prefer command-line tools due to my sysadmin days – but for many programmers, GUI tools help debugging a lot. You need to make sure your programmers working with Redis can debug it when you have problems, which means a bigger investment in scripts vs. just downloading MySQL Workbench or Oracle SQL Developer

Check and Double-Check

The last thing worth mentioning is this: Don’t trust your own app. Even if you have an atomic gate at the start of a transaction, do sanity checking at the end too. There are a few reasons for this:

  • The lock may have expired for some reason, and you didn’t test for this
  • Your locking server may have crashed when you’re in the middle of a transaction
  • There could be a background job overlapping with a front-end transaction
  • Your software may have bugs (improbable, I know)

For example, we had a background job that was using the same lock as a front-end service. This ended up being a design mistake, but it was difficult to track down because it happened very infrequently. The only way we found it was we had assertions that would get hit periodically on supposedly impossible conditions. Once we correlated the times with the background job running, we were able to fix the issue rather quickly.

So my opinion is this: Try to do the right thing, but if it screws up, apologize to the user, recover, and move on.