Linux Network Tuning for 2013

Linux distributions still ship with the assumption that they will be multi-user systems, meaning resource limits are set for a normal human doing day-to-day desktop work. For a high-performance system trying to serve thousands of concurrent network clients, these limits are far too low. If you have an online game or web app that’s pushing the envelope, these settings can help increase awesomeness.

The parameters we’ll adjust are as follows:

  • Increase max open files to 100,000 from the default (typically 1024). In Linux, every open network socket requires a file descriptor. Increasing this limit will ensure that lingering TIME_WAIT sockets and other consumers of file descriptors don’t impact our ability to handle lots of concurrent requests.
  • Decrease the time that sockets stay in the TIME_WAIT state by lowering tcp_fin_timeout from its default of 60 seconds to 10. You can lower this even further, but too low, and you can run into socket close errors in networks with lots of jitter. We will also set tcp_tw_reuse to tell the kernel it can reuse sockets in the TIME_WAIT state.
  • Increase the port range for ephemeral (outgoing) ports, by lowering the minimum port to 10000 (normally 32768), and raising the maximum port to 65000 (normally 61000). Important: This means you can’t have server software that attempts to bind to a port above 9999! If you need to bind to a higher port, say 10075, just modify this port range appropriately.
  • Increase the read/write TCP buffers (tcp_rmem and tcp_wmem) to allow for larger window sizes. This enables more data to be transferred without ACKs, increasing throughput. We won’t tune the total TCP memory (tcp_mem), since this is automatically tuned based on available memory by Linux.
  • Decrease the VM swappiness parameter, which discourages the kernel from swapping memory to disk. By default, Linux attempts to swap out idle processes fairly aggressively, which is counterproductive for long-running server processes that desire low latency.
  • Increase the TCP congestion window, and disable reverting to TCP slow start after the connection is idle. By default, TCP starts with a single small segment, gradually increasing it by one each time. This results in unnecessary slowness that impacts the start of every request – which is especially bad for HTTP.

Ok, enough chat, more code.

Kernel Parameters

To start, edit /etc/sysctl.conf and add these lines:

# /etc/sysctl.conf
# Increase system file descriptor limit
fs.file-max = 100000

# Discourage Linux from swapping idle processes to disk (default = 60)
vm.swappiness = 10

# Increase ephermeral IP ports
net.ipv4.ip_local_port_range = 10000 65000

# Increase Linux autotuning TCP buffer limits
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE
# Don't set tcp_mem itself! Let the kernel scale it based on RAM.
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# Make room for more TIME_WAIT sockets due to more clients,
# and allow them to be reused if we run out of sockets
# Also increase the max packet backlog
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 10

# Disable TCP slow start on idle connections
net.ipv4.tcp_slow_start_after_idle = 0

# If your servers talk UDP, also up these limits
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192

# Disable source routing and redirects
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0

# Log packets with impossible addresses for security
net.ipv4.conf.all.log_martians = 1

Since some of these settings can be cached by networking services, it’s best to reboot to apply them properly (sysctl -p does not work reliably).

Open File Descriptors

In addition to the Linux fs.file-max kernel setting above, we need to edit a few more files to increase the file descriptor limits. The reason is the above just sets an absolute max, but we still need to tell the shell what our per-user session limits are.

So, first edit /etc/security/limits.conf to increase our session limits:

# /etc/security/limits.conf
# allow all users to open 100000 files
# alternatively, replace * with an explicit username
* soft nofile 100000
* hard nofile 100000

Next, /etc/ssh/sshd_config needs to make sure to use PAM:

# /etc/ssh/sshd_config
# ensure we consult pam
UsePAM yes

And finally, /etc/pam.d/sshd needs to load the modified limits.conf:

# /etc/pam.d/sshd
# ensure pam includes our limits
session required pam_limits.so

You can confirm these settings have taken effect by opening a new ssh connection to the box and checking ulimit:

$ ulimit -n
100000

Why Linux has evolved to require 4 different settings in 4 different files is beyond me, but that’s a topic for a different post. 🙂

TCP Congestion Window

Finally, let’s increase the TCP congestion window from 1 to 10 segments. This is done on the interface, which makes it a more manual process that our sysctlsettings. First, use ip route to find the default route, shown in bold below:

$ ip route
default via 10.248.77.193 dev eth0 proto kernel
10.248.77.192/26 dev eth0  proto kernel  scope link  src 10.248.77.212

Copy that line, and paste it back to the ip route change command, adding initcwnd 10 to the end to increase the congestion window:

$ sudo ip route change default via 10.248.77.193 dev eth0 proto kernel initcwnd 10

To make this persistent across reboots, you’ll need to add a few lines of bash like the following to a startup script somewhere. Often the easiest candidate is just pasting these lines into /etc/rc.local:

defrt=`ip route | grep "^default" | head -1`
ip route change $defrt initcwnd 10

Once you’re done with all these changes, you’ll need to either bundle a new machine image, or integrate these changes into a system management package such as Chef or Puppet.

Additional Reading

The above settings were pulled together from a variety of other resources out there, and then validated through testing on EC2. You may need to tweak the exact limits depending on your application’s profile. Below are a few additional posts that make good reading:

 

Replacing Macbook HD with an SSD

My poor little laptop hard drive had been whining and whimpering, so I upgraded it to an SSD. Turned out to be inexpensive and very DIY friendly, so here are my cliffs notes.

Step 1: Choose an SSD

Mucho fasto SSDThe consensus is that Other World Computing (OWC) makes the most Mac-compatible SSD’s. I went with the OWC Mercury Extreme Pro SSD. 120GB cost me $149. If you have an older Macbook (pre-2011), or just want to save money, you can go with the slightly slower OWC Mercury Electra SSDinstead. I sprung for FedEx 2-day shipping for ~$10.

Step 2: Buy a USB Drive Case

This is so you can attach the new drive to your laptop temporarily, to copy over your data. Needs to be a 2.5” SATA for the SSD, with a USB connection for the laptop. Amazon has the Vantec NexStar 2.5-Inch SATA to USB 2.0 External Enclosure for $7.99. Done.

Step 3: Put Drive in Case

Open the NexStar drive case, and plug the OWC SSD into the connector. Close it up and attach it to your laptop via the USB cable. This step should seem very simple. If not, rethink continuing w/o help.

Step 4: (Optional) Grab a Beer

Drake’s Denogginzer goes well with upgrade-related tasks. Warning: With 22oz at 9.75%, the clock is now ticking.

Step 5: Partition the Drive

Disk Utility Window

Once you attach the drive, a window will popup saying something like “Unrecognized drive format”. Click the “Initialize” button to open up Disk Utility. You should see a screen like the one at right. Click the “Partition” button in the right pane, and do the following:

  1. Create a partition with all the available space, named whatever you want. I called mine “SSD Boot HD”.
  2. Click “+” to add a partition named “Recovery HD” of at least 750 MB in size. This is required for OSX Lion, Mountain Lion, or later, or if you’re using FileVault (disk encryption).

Both should be the default type of “Mac OSX Extended (Journaled)”. It’s important that the “Recovery HD” partition be second, because of restrictions on how Lion/Mountain Lion can and can’t resize boot partitions.

Step 6: Clone the Drive

Carbon Copy Cloner

Download Carbon Copy Cloner and install it. Theres’s a fully-functional 30-day trial so you can decide whether to purchase a license later. It’s a great program and worth supporting if possible.

When it first starts up, it’ll ask you if you want to see the “Quick Start Guide”. Say yes. It opens up instructions telling you exactly how to copy your existing hard drive to a new external drive.

All you do is select your existing drive on the left, probably “Macintosh HD”, and your new drive on the right (whatever you called it in Step 5), and click “Clone”.

You may get a popup saying something like, “Recovery HD partition does not contain the correct OS.” If so, follow the on-screen instructions to update it. I found CCC didn’t properly reset itself after this, so I had to exit, re-launch, and then click “Clone” again to start the clone.

Step 7: Wait

Sip on your beer from Step 4.

Step 8: Shutdown Mac, Swap Drives

Once the clone is finished, shutdown and unplug the power cable. Pull the external drive out of the case, reversing Step 3. Then, follow these excellent instructionsto physically install the SSD in your Macbook. Requires a teeny tiny midget screwdriver.

Step 9: Boot Mac, Enjoy

Everything should Just Work™, although I did notice that some programs like Dropbox required me to reenter my email/password the first time. For fun, try clicking on a beastly program like Photoshop or Word and it should open up unnervingly fast.

Atomic Rant Redux

My atomic rant has gotten a ton of traffic – more than I foresaw.  Seems atomicity is a hot topic in the web world these days. Increasing user concurrency, coupled with more interactive apps, exposes all sorts of edge cases. I wanted to write a follow-up post to step back and look at a few more high-level concerns with atomicity, as well as some Redis-specific issues we’ve seen.

Know Your Actors

new-moon-official-castIn my original rant, I used the example of students enrolling in online classes to illustrate why atomicity was crucial to operations with multiple actors. And speaking of actors, they’re an even better target analogy. You need to assume your actors are all going to try to jam through the audition door at the same time. What happens if they are all talking to the director at once? How many conversations can continue in parallel? If you’re careful, you can get away with one final gate at the end, which makes your life infinitely easier. That is, funnel everyone to a decision point, congratulate one person, then tell the others sorry.

Of course, if that funnel is too long, you’re going to piss off your users in a major way. If you’ve ever bought tickets from Ticketmaster, you’re familiar with this problem. Granted they’ve gotten much better over the years (which is saying something…), and this is partially due to embracing the Amazon guesses and apologies approach. If you have 200 tickets left, a person can probably get one. But if you have 10 tickets left, they’re probably going to get screwed. If you can help with the user’s expectations (“less than 10 tickets left!”) then people are more likely to be forgiving.

In the world of online games, this translates to showing players the number of slots left in a game, but then handing the situation where there were 2 slots left but you were the third person to hit “Submit”. You always need to handle these errors, because there’s no way to completely eliminate race conditions in a networked application.

Recovering from Hiccups

isharescapsizeSooner or later, your slick, smooth-running atomic system is going to have problems. Even if it’s well-engineered, you could have a large outage such as a system crash, datacenter failure, etc. Plan on it.

Using Redis to offload atomic ops from the DB yielded big performance benefits, but added fragility. You now have two systems that must stay in sync. If either one crashes, there’s the possibility that you’re going to have dangling locks for records that are ok, or vice-versa. So you need a way to clear them. In a perfect world with infinite time, you’d be able to engineer a self-detecting, self-repairing system that can auto-recover. Good luck with that. A cron job that deletes locks older than a certain time works pretty well for the rest of us.

It’s also a good idea to have a script you can run manually, in the event you know you need to reset certain things. For example, to handle the case where you know your Redis node went down, you could have a script that deletes all locks where the ID is > the current max ID in the DB. Oracle and other systems have similar concepts built into their native locking procedures

Troubleshooting Redis is a Pain

Unfortunately, Redis is lacking in the way of tools because it is still young. There is the PHP Redis Admin app, but its development appears to have stalled. Beyond that it’s pretty much roll-your-own-scripts at this point. We’ve thought about developing a general-purpose Redis app/tool ourselves, but with the Redis 2.0 changes and VMWare hiring Salvatore the tools side is a bit “wait and see”.

So before you start throwing all of your critical data into Redis, realize it’s a bit black-box at this point (or at least, a really dark gray). I’m not a GUI guy personally – I prefer command-line tools due to my sysadmin days – but for many programmers, GUI tools help debugging a lot. You need to make sure your programmers working with Redis can debug it when you have problems, which means a bigger investment in scripts vs. just downloading MySQL Workbench or Oracle SQL Developer

Check and Double-Check

The last thing worth mentioning is this: Don’t trust your own app. Even if you have an atomic gate at the start of a transaction, do sanity checking at the end too. There are a few reasons for this:

  • The lock may have expired for some reason, and you didn’t test for this
  • Your locking server may have crashed when you’re in the middle of a transaction
  • There could be a background job overlapping with a front-end transaction
  • Your software may have bugs (improbable, I know)

For example, we had a background job that was using the same lock as a front-end service. This ended up being a design mistake, but it was difficult to track down because it happened very infrequently. The only way we found it was we had assertions that would get hit periodically on supposedly impossible conditions. Once we correlated the times with the background job running, we were able to fix the issue rather quickly.

So my opinion is this: Try to do the right thing, but if it screws up, apologize to the user, recover, and move on.

An Atomic Rant

You are probably not handling atomic operations properly in your app, and probably have some nasty lurking race conditions. The worst part is these will get worse as your user count increases, are difficult to reproduce, and usually happen in your most critical pieces of code. (And no, your unit tests can’t catch them either.)

Spoiler: If you’re part of the ADHD generation and want to skip learning and go straight to the punchline, use Redis and redis-objects for all your atomic data needs.

Brush Up Your Resume

Let’s assume you’re writing an app to enable students to enroll in courses. You need to ensure that no more than 30 students can sign up for a given course. In your enrollment code, you have something like this:

@course = Course.find(1)
if @course.num_students < 30   @course.course_students.create!(:student_id => 101)
  @course.num_students += 1
  @course.save!
else
  # course is full
end

You’re screwed. You now have 32 people in your 30 person class, and you have no idea what happened.

“Well no duh,” you’re saying, “even the ActiveRecord docs mention locking, so I’ll just use that.”

@course = Course.find(1, :lock => true)
if @course.num_students < 30
  # ...

Nice try, but now you’ve introduced other issues. Any other piece of code in your entire app that needs to update anything about the course – maybe the course name, or start date, or location – is now serialized. If you need high concurrency, you’re screwed (still).

You think, “ah-ha, the problem is having a separate counter!”

@course = Course.find(1)
if @course.course_students.count < 30    @course.course_students.create!(:student_id => 101)
else
  # course is full
end

Nope. Still screwed.

The Root Down

It’s worth understanding the root issue, and how to address it.

Race conditions arise from the difference in time between evaluating and altering a value. In our example, we fetched the record, then checked the value, then changed it. The more lines of code between those operations, and the higher your user count, the bigger the window of opportunity for other clients to get the data in an inconsistent state.

Sometimes race conditions don’t matter in practice, since often a user is only operating on their own data. This has a race condition, but is probably ok:

@user = User.find(params[:id])
@post = Post.create(:user_id => @user.id, :title => "Whattup")
@user.total_posts += 1  # update my post count

But this would be problematic:

@blog = Blog.find(params[:id])
@post = Post.create(:blog_id => @blog.id, :title => "Whattup")
@blog.total_posts += 1  # update post count across all users

As multiple users could be adding posts concurrently.

In a traditional RDBMS, you can increment counters atomically (but not return them) by firing off an update statement that self-references the column:

update users set total_posts = total_posts + 1 where id = 372

You may have seen ActiveRecord’s increment_counter class method, which wraps this functionality. This solves half the problem of updating the counters atomically. But this has the significant side effect that your object is no longer in sync with the DB, so you get other issues:

@blog = Blog.find(params[:id])
Blog.increment_counter :total_posts, @blog.id
if @blog.total_posts == 1000
  # the 1000th poster - award them a gold star!

The DB says 1000, but your @blog object still says 999, and the right person doesn’t get their gold star. Sad faces all around.

A Better Way

Bottom line: Any operation that can alter a value must also return that value in the same operation for it to be atomic. If you do a separate get then set, or set then get, you’re open to a race condition. There are only a few systems that support an “increment and return” type operation, and Redis is one of them (Oracle sequences are another, and Postgres supports “update returning”).

When you think of the specific things that you need to ensure, many of these will reduce to numeric operations:

  • Ensuring there are no more than 30 students in a course
  • Getting more than 2 but less than 6 people in a game
  • Keeping a chat room to a max of 50 people
  • Correctly recording the total number of blog posts
  • Only allowing one piece of code to reorder a large dataset at a time

All except the last one can be implemented with counters. The last one will need a carefully placed lock.

The best way I’ve found to balance atomicity and concurrency is, for each value, actually create two counters:

  • A counter you base logic on (eg, slots_taken)
  • A counter users see (eg, current_students)

The reason you want two counters is you’ll need to change the value of the logic counter firstbefore checking it, to address any race conditions. This means the value can get wonky momentarily (eg, there could be 32 slots_taken for a 30-person course). This doesn’t affect its function – indeed, it’s part of what makes it work – but does mean you don’t want to display it.

So, taking our Course example:

class Course < ActiveRecord::Base
  include Redis::Objects

  counter :slots_taken
  counter :current_students
end

Then:

@course = Course.find(1)
@course.slots_taken.increment do |val|
  if val <= @course.max_students     @course.course_students.create!(:student_id => 101)
    @course.current_students.increment
  end
end

Race-condition free. Why? Because we’re checking the direct result of the increment operation against a value. The set and get operations are one and the same, which is the crucial piece. If that code block returns false, the counter is rewound, and no animals were harmed in this atomic op.

Then, due to the current_students counter, your views get consistent information about the course, since it will only be incremented on success. There is still a race condition where current_students could be less than the real number of CourseStudent records, but since you’ll be displaying these values in a view (after that block completes) you shouldn’t see this manifest in real-world usage.

Now you can sleep soundly, without fear of getting fired at 3am via an angry phone call from your boss. (At least, not about this…)