Adjusting drive timeouts with mdb on Solaris or OpenIndiana
May 14th, 2011
Update (New): These timeouts don’t do squat because mpt_sas doesn’t honour the timeouts. This was recently uncovered by Nexenta and a patch to fix it is about to hit Illumos shortly. I’ll post when it does. Another patch is in progress which will further improve how mpt_sas handles failed drives. Thanks to Albert Lee for his work on them – you, sir, rock!
Update (Old): These timeouts don’t work nearly as well as one would hope, unfortunately the sd timeouts get passed to the driver which in the case of mpt/mpt_sas, appear to do very little with them. I have raised this as an issue within the Illumos community and the debate was quite polarising; the kernel developers deny there is a problem or disagree on how to solve it, despite lots of people complaining of the same symptoms. Unfortunately I think it’s a difficult problem to solve due to the wide variety of hardware types that ZFS/Illumos is deployed on.
Our way of coping with dodgy drives is to preempt their failure via trigger happy SMART/iostat monitoring scripts that zpool offline bad drives before they fail.
Yesterday we suffered our first disk failure in our shiny new NFS cluster that has been operating flawlessly for 3 months. The NFS cluster we have is quite nice – it consists of a pair of NFS servers (96GB of RAM, Dual Intel E5620 CPUs) dual-attached to a set of LSI SAS 6Gbps JBOD arrays, with lots of Seagate Constellation ES 2TB enterprise SAS drives. For good measure there’s 1.5TB of SSD cache (6x256GB SSDs) acting as a read cache (L2ARC), and a ZeusRAM SSD acting as the write cache (ZIL). It runs a custom build of OpenIndiana.
Ordinarily a disk failure would result in at most a few minutes of stall while the OS waits for the drive to recover, and gives up. However, this drive decided simply to run glacially slowly, so it didn’t get removed in a timely fashion. In fact, it didn’t get removed at all, resulting in all IO to the SAN being stuck, causing a rather severe outage. 45 minutes in total.
When things became unresponsive, we logged in, and “iostat -xn” showed a 100% busy time on one of the disks, while the others did nothing. We attempted to “zpool offline baddisk”. Nothing much happened, presumably because the OS thought the drive was fine and was waiting on some queued IO finishing, or something along those lines. We had no immediate way of yanking the disk out, so we decided to failover the cluster from the primary NFS node to the secondary. This consists of powering off the primary node and letting the cluster software import the ZFS zpool and bring NFS services online.
When the secondary NFS node started importing the zpool, iostat once again showed a 100% busy time on the bad disk. Crap. Andrzej had the bright idea of deleting the disk entries from /dev, and sure enough this prompted ZFS to think the drive had disappeared, and the pool finally imported.
So immediately the question springs to mind, why did the OS not take this bad disk out of service? We consulted with our upstream vendor (contacted the folks over at Illumos) and all became clear.
The answer lays in the defaults in the Solaris SCSI subsystem. The default timeout for IO is 60 seconds with 5 retries (or 3 retries if its fibre channel/eSAS). For a storage array like ours, this is a 3 minute timeout for a single IO – or in other words, a very long time. Since the disk was accepting a trickle of IO, this timeout was never really reached.
Thankfully the timeouts can be adjusted, and Garrett D’Amore, the founder of Illumos and one of the lead developers who works at Nexenta, strongly suggested tuning the timeout to 5 seconds, with 3 retries.
Setting the timeout value is quite easy – its the system wide tunable sd_io_time. Keep in mind this will affect all disks. Edit /etc/system and drop in:
set sd:sd_io_time=5
If you have desktop SATA drives you’ll probably want a higher timeout, especially if you don’t have TLER (Time limited error recovery) on them, which limits error recovery to around 7 seconds.
The number of retries is set via /kernel/drv/sd.conf via sd-config-list. This file allows the setting to be set per-disk type via sd-config-list. To get 3 retries, the variable would be “retries-timeout:3″. The format of this file is a bit weird, here is an example for two disks:
sd-config-list = "STEC ZeusRAM ", "throttle-max:32, disksort:false, cache-nonvolatile:true",
"SEAGATE ST32000444SS ", "retries-timeout:3";
The bit where you define the disk type is a fixed length field, consisting of 8 characters for the vendor, and 16 characters for the product. So you have to pad the field out to the correct length with spaces.
Once these are set, reboot to activate. You can check the values are set by doing:
## Print system wide sd_io_time timeout value:
# echo "sd_io_time::print" | mdb -k
0x3c
## Print per-disk timeout and retry values:
# echo "::walk sd_state | ::grep '.!=0' | ::sd_state" | mdb -k | egrep "^un|un_retry_count|un_cmd_timeout"
un: ffffff093239d9c0
un_retry_count = 0x3
un_cmd_timeout = 0x5
un: ffffff093239d380
un_retry_count = 0x3
un_cmd_timeout = 0x5
...
The return values are in hexadecimal, so for example 0x3c is 60 seconds.
Adjusting values without rebooting
We have a number of storage servers in production, some of which we really didn’t want to reboot just to change the timeout value. After discussions with some of the Illumos kernel developers,
we worked out how to set the property at runtime using the modular Solaris debugger, mdb. This allows editing kernel values at runtime.
The system wide sd_io_time is used to populate a per-disk timeout value which is also stored in the same structure as the per-disk retry count. So changing the values is pretty similar.
First, we want to obtain the memory values for the settings we wish to edit:
# echo "::walk sd_state | ::grep '.!=0' | ::print -a struct sd_lun un_cmd_timeout" | mdb -k > /tmp/un_cmd_timeouts # cat /tmp/un_cmd_timeouts ffffff0d347a3a7c un_cmd_timeout = 0x3c ffffff0d247983bc un_cmd_timeout = 0x3c ffffff0d3429d3fc un_cmd_timeout = 0x3c ffffff0d55daf37c un_cmd_timeout = 0x3c ...
Now we have the values in /tmp/un_cmd_timeouts, we can set the value using mdb -kw:
# for i in `cat /tmp/un_cmd_timeouts | awk '{print $1}'` ; do echo ${i}/W 0x5 | mdb -kw ; done
We can then check the value was set by re-running:
# echo "::walk sd_state | ::grep '.!=0' | ::print -a struct sd_lun un_cmd_timeout" | mdb -k
Now we can do the same for un_retry_count:
# echo "::walk sd_state | ::grep '.!=0' | ::print -a struct sd_lun un_retry_count" | mdb -k > /tmp/un_retry_count
# for i in `cat /tmp/un_retry_count | awk '{print $1}'` ; do echo ${i}/W 0x3 | mdb -kw ; done
Hey presto, we just adjusted boot time kernel parameters on the fly :-)
If you need to know which disk is which, you can assume the output from mdb is ordered, and do:
echo "::walk sd_state | ::grep '.!=0' | ::print struct sd_lun un_sd | ::print struct scsi_device sd_dev | ::devinfo -q" | mdb -k
This returns the sd instance id, which can be seen from “iostat -E”. StackOverflow has some answers for mapping from sd to device name should you need to.
Concluding Remarks
With these values in place, our timeout is reduced from upwards of 3 minutes, to a mere 15 seconds. This is far more likely to cause the OS to offline dodgy disks like the one we were experiencing issues with.
There has been some recent discussion on the Illumos mailing lists regarding the default sd_io_time value, suggesting that the default should be lowered to 8 seconds. This has caused a bit of a furore, as people using Solaris with fibre channel disk arrays require higher timeouts, say 180 seconds. So there are people on both sides of the fence. But one thing is for sure – its a setting more people should know about.
Entry Filed under: General

10 Comments Add your own
1. whatever | September 4th, 2011 at 7:11 pm
Hey, Alasdair:
Can you write an article detailing how to setup a “NFS cluster” with OpenIndiana?
As far as I know, when Oracle closed down OpenSolaris, they stopped contributing to OpenHA, which was only working with OpenSolaris 2009.06. The source code of OpenHA is now moved to Illumos gate, but activities in that project is low. You can’t even compile it nowadays. Considering Oracle Solaris Cluster 3.3u1 doesn’t run on Illumos based distro, or even Solaris 11 Express, I wonder which software package you used to setup HA ZFS and HA NFS for VM storage. (The only thing I can think of is to license RSF-1 just like NexentaStor HA)
Thanks
2. Richard Elling | October 14th, 2011 at 3:10 pm
For most HDD suppliers, 5 seconds is too low. If you consult the HDD specifications, the proper value is usually documented. For example, most Seagate nearline SAS models have a <7 second specification. Recommend setting to 8 seconds instead of 5 for two reasons: 1) fits the specs for many low-cost “hardware” RAID cards, 2) avoids false positives.
— richard
3. Alasdair | October 31st, 2011 at 11:33 pm
Hi Richard,
Thanks for the comment about the timeout, it’s a good suggestion.
However in testing with failing harddrives (on mpt_sas anyway), we see that the sd timeouts are completely ignored so my entire post above is moot!
4. Matt Connolly | November 29th, 2011 at 11:09 am
Interesting that this can affect small setups as well as large: I’ve seen this problem on my home NAS running OpenIndiana with two mirrored drives. In my case, one of them was a Western Digital green drive which slowed the whole machine to a near-halt by being busy.
I solved the problem by yanking that drive out and throwing it in the bin… Didn’t know about this then! :)
5. Aaron Knodel | December 12th, 2011 at 3:50 pm
Hi Alasdair,
Can you please comment further on the failure of this setting using mpt_sas? It sounds like what you’re saying is, this setting should in theory work, but the driver is causing issues and it never gets to the ZFS level to use this setting. Is that right? Did you test using the known bad drive from the beginning of the article? Any other details would be appreciated if you have them.
Thanks
6. Jon Strabala | February 10th, 2012 at 5:18 pm
I second this – can you write an article detailing how to setup a “NFS cluster” with OpenIndiana? I would rather start off with something rather than roll my own understanding the “risks” are all mine.
Yes I read your comment in the oi_a51a release: “if you need clustering, then you should be able to justify the budget for it. Clustering on the cheap is a recipe for disaster”
My justification same reason both illumos and openindiana exist – need I say more?
Thanks in Advance
7. Alasdair | March 22nd, 2012 at 5:31 am
Hi Aaron,
Basically the setting didn’t help the situation I was experiencing. The sd timeouts don’t seem to be used by the mpt_sas driver. So adjusting them is completely pointless.
The way we reduced the possibility of bad disks impacting our storage is to preemptively zpool offline disks that show any signs of misbehaviour, such as taking drives out of service that exhibit a single hard error, or show any SMART errors. So far this has worked quite well.
But ultimately the storage subsystem and/or drivers in the OS need improving. There are bugs open about this:
https://www.illumos.org/issues/1553
https://www.illumos.org/issues/1069
8. Alasdair | March 22nd, 2012 at 5:36 am
Hi Jon,
It’s not too hard if you’re using NFS since NFS is stateless. You also have to use SAS drives and SAS arrays as you can plug a dual-head SAS array into two physical servers and both nodes can see the disks at the same time.
You then just need to write some scripts to:
a. Detect failure (we use NRPE to check writing to an nfs mount mounted via a loopback cable between the primary and the secondary node)
b. Have the secondary pull the plug on the primary (we do this via IPMI)
c. Forcibly import the zpool and bring up the IPs
To improve the c failover time we use VNICs with the same MAC on both machines, so hosts don’t have to learn new MAC addresses.
It’s not a huge amount of code.
With a big array (20TB+) its not fast enough to handle storing virtual machine disk images on it, as the failover can take > 180 seconds which is long enough for Windows and Linux to declare their disks dead and get very unhappy. But it’s fine for NFS clients which recover quite happily.
9. Chris | May 31st, 2012 at 3:34 am
Hi Alasdair, out of curiosity, what type of 256GB SSDs are you using on that server? We are working on a config almost exactly like yours and I happened to stumble on your post.
10. Alasdair | July 16th, 2012 at 4:33 pm
Hi Chris,
Sorry for taking a while to respond!
We are using Crucial SSDs, http://www.crucial.com. We’ve found them to be reliable, reasonable performance, cost effective, and they supply a 3 year warranty. A good all-rounder, even for enterprise workloads.
Leave a Comment
Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
Trackback this post | Subscribe to the comments via RSS Feed