Linux: When massive load on the disk makes the system freeze

This post was written by eli on October 27, 2010
Posted Under: Linux

The bad news are that Linux has been behaving pretty badly when the disk is under heavy load. The good news are that all you need is a kernel upgrade.

First, let’s run a simple experiment:

$ dd if=/dev/zero of=junkfile bs=1k count=20M

Don’t expect anything dramatic to happen right away. It takes half a minute or so. In the meanwhile, on another command prompt, let’s keep track on the memory info:

$ watch cat /proc/meminfo

What will soon happen, is that MemFree goes low and Cached goes high. This is normal behavior: As the dd command floods the disk queue with blocks to be written, unused memory is allocated as temporary buffers for the data just to be written or just written. Which is a sensible thing to do, given that the memory is unused anyhow, so why not remember what we just wrote, in case someone asks to read it?

So far so good. The trouble begins when MemFree gets really low, say around 50000 kB, and the heavy disk load continues. For reasons beyond me, this makes the system stall, and then recover when the disk load is stopped. For example, our watch program stops updating. Have a look on the time stamp at the upper right, and see that it freezes for several seconds.

If the memory goes low on your system, but the system doesn’t get stuck, nevermind. If it does, you may need the patch for vmscan.c, or just upgrade to the current stable kernel (2.6.36 when these lines were written).

A simple way to check if your kernel has this fixed or not, is to look for the should_reclaim_stall() function is mm/vmscan.c in your kernel source. If the function is there, you’re fine. If not, and you’re running on some 2.6.32 or something like that, it’s very likely that you need the fix.

Reader Comments

I have been getting frequent freezes plague in the Linux community on my Debian Squeeze

See here for more details
http://newyork.ubuntuforums.org/showthread.php?p=10045183&posted=1#post10045183

Do you think this has any relevance

#1 
Written By KernelSpace on October 29th, 2010 @ 22:52

I didn’t read all 1200+ posts of course, but it appears like they are talking about a system crash. I’m talking about a temporary condition, which doesn’t compromise the system’s stability. So no, I wouldn’t put my money on that.

#2 
Written By eli on October 29th, 2010 @ 23:03

Ah thanks for the clarification, I believe I was experiancing less lag when using the Debian 2.6.36 experimental kernel, but still freezes

#3 
Written By KernelSpace on October 30th, 2010 @ 00:09

I’m still having massive problems with this on 2.6.37, and now there’s that kworker thing going on as well.

Copying files – The system grinds to a halt. It’ll stall for hours.

Creating an endless list in Python – The system grinds to a halt again. No root privileges required.

The disk in question is encrypted with LUKS, and I remember there being a previous problem with LUKS-encrypted swap causing the same kind of stalling and the intense flashing of the disk LED. Perhaps using LUKS and/or LVM and/or other virtual block devices can cause some sort of race condition which stalls the system.

At any rate, life’s too short for this kind of stuff. Time to look at PC-BSD!

#4 
Written By David Oftedal on February 1st, 2011 @ 05:38

↑ By the way… You have to wonder what part of the disk is being used in that second case (Python filling up memory), when the disk isn’t being used for swap…

#5 
Written By David Oftedal on February 2nd, 2011 @ 01:44

Actually, that’s “regular” behaviour. I think it was always there and is because I/O throttling is a bit complicated due to buffering etc.

It (IMHO) has always been the case under Linux that when the kernel thinks there is an urgent need to write back dirty pages, it won’t let anything else happen until the last dirty page has been synced to disc.

And yes, that’s lame. It’s a reason why linux is not suited for mission-critical stuff. There is no useful disk QoS. The developers all concentrate on the CPU scheduler — which seems to be a pretty simple task because you just don’t have to implement caching/buffering there because it is done in hardware already.

Maybe your situation could be improved by using a battery backed write cache controller? Another solution could be to limit those processes’ max memory usage, who do a lot of I/Os. Recent kernels support that via the cgroupfs. Interestingly, that also limits cache usage and thereby prevents cache draining by processes who read a lot of data that is never read twice.

#6 
Written By Mark on February 11th, 2011 @ 00:52

This is not regular behavior, but a bug. If Linux is considered unfit for mission-critical tasks, then it’s lost.

And it’s solved in recent kernels, so the solution is to upgrade. If you don’t mind the new bugs…

#7 
Written By eli on February 11th, 2011 @ 00:58

I agree with Eli! :D

And happily, the problem IS much better in newer kernels. In 2.6.37 it was pretty bad, but in 2.6.38, it’s down to a few seconds of stalling before it clears up. Perhaps in 2.6.39 it’ll be milliseconds…?

#8 
Written By David Oftedal on February 18th, 2011 @ 23:23

I had similar freeze issues last year but I fixed them by using the deadline IO scheduler. By default Linux uses cfg and this does not give good results for a desktop when you have intentive IOs.

In short, use

echo deadline > /sys/block/sda/queue/scheduler

For details, check
http://blog.vacs.fr/index.php?post/2010/08/28/Solving-Linux-system-lockup-when-intensive-disk-I/O-are-performed

#9 
Written By Stephane Carrez on March 8th, 2011 @ 00:18

Thank you for that suggestion. I’ve changed the IO scheduler on this system to deadline, and forwarded the suggestion to the bug report on Ubuntu Launchpad, to see if it’ll solve the problem for the other people who’ve reported it.

#10 
Written By David Oftedal on June 27th, 2011 @ 15:05

Add a Comment

required, use real name
required, will not be published
optional, your blog address