Disk life spans over a span of life

Oops, I found this in my upcoming posts folder when I thought I’d posted it already. Some of the dates might be off as I wrote it a year ago and only casually updated the timings now.

Over the years there’s been quite a few articles on the life spans of consumer grade sata drives. The most notable I can recall are the ones by Google from a few years ago and some more recent ones from Backblaze and another from Backblaze (they seem to be posting quarterly now). All of those articles cover huge fleets of disks over several years maintained in computer room environments. They’re excellent reads and produce some interesting results. More recently a similar themed paper about flash storage has come out at FAST 16 conference.

I’ve kept some rough notes on my disk fleet over the years and recently noted the following in my fleet of 2TB Western Digital Greens. Lots of people hated the green power series citing various issues with the spin down or firmware issues. I’ve run mine (mostly) flawlessly on solaris (openindiania and now illumos) for years, and went from 1TB greens to 2TB and later 3TB’s. To avoid beating around the bush I’ll focus only on the 2TB drives since this is where I have the most relevant data.

My initial purchase was 8 drives in April 2010 which grew over the years to a system of 22 disks at the end.
Just under a year later (Feb 2011) I bought 2 more, most likely to replace a failure and keep a spare on hand. The exact use I never recorded.
Another year later (Jan 2012) I bought 2 more;
3 months later (April 2012) another 2. Around this time I expanded from an 8 disk system to 12, so it’s reasonable to think I’d had either 2 failures by this point or a failure and a cold spare (again no record).
5 Months later (Sept 2012) I bought another 8 drives, so this would have coincided with my expansion into a 24 bay case. Around here I switched to triple parity (raidz3) with 19 disks total (aimed for a correct zfs stripe size of a power of two plus parity, for performance reasons).
5 Months later again (Feb 2013) another 2 disks. Around here switched from raidz3 to 2 raidgroups of raidz2, 20 disks total. The performance of the larger stripe raidz3 was too low, and adding one more disk allowed me to half the raid group size.
9 Months later (Nov 2013) 3 more disks. Around here, expanded out to 2 raidgroups of 11 totalling 22 disks.
4 Months later (March 2014) 4 more disks. These were the last 2TB drives I bought, due to price difference to 3TB.
12 months later (March 2015) 2x 3TB disks, 1 to replace a 2TB failure and a cold spare.

As of March 2014 I had 4 disks showing 35000 power on hours, so they would have been from that initial purchase. Not bad having half of them still running nearly 4 years later.

The final 4x 2TB disks were purchased to replace the disks from the 35000 hour set. Those 4 disks had average request service times 6-10 times slower than the rest - solid 12ms vs 1-2ms, and that was a serious performance impact, probably due to bad sector relocation causing additional seeks. Yes that’s not a typo, 1-2ms on a green drive is my average, sounds odd right. In theory the rotational speed dictates the worst case seek times, however in practice on my system this is rarely seen even under medium load. Under heavy load it all goes pear shaped because these are consumer grade sata drives spinning at under 7200rpm (WD greens spin speed varies by model).

The other observation there is I switched from fewer larger raidgroups to more smaller, this is for performance reasons mainly as large raidgroups in zfs seem to perform poorly, even with plenty of ram and sizable l2arc. This theory was validated with my 3TB box which uses 3 smaller raidgroups and performs very well.

In all of this time I’ve removed 3 of the 2TB disks for developing sector read/write errors (visible and repaired by zfs) before they actually failed. My 3TB’s haven’t been as good, however I don’t leave them spinning 24x7. They were all pre-owned drives with an average drive age of 12000 hours when I got them - they’re a backup box which I use for offline data protection and an excellent way to reshape my zpools by dumping and reloading (over a 10G cross over link).

Fast forward this story to November 2015 and I’ve upgraded the 2TB’s to 6TB’s as I’d decided the 8TB non archive drives were too far away to wait for. I also switched to the WD Red NAS drives and they also work flawlessly with illumos/zfs. Took a few days to resync back from the archive. Over the following 18 months one of the 6TB’s has developed bad sectors and I’ll be looking into swapping that out soon if I can figure out the warranty process.

The 3TB cold spare which remains spare to this day, now 2 years old. I’ve had only 2 of the 3TB’s fail on me so far. If my story and counting is correct I’ve had only 3 of the 31x 2TB drives fail on me, 3 replaced for developing bad sectors and 4 replaced for performance going to crap, over a total of 5.6 years. … let me rainman that … 6.25% cumulative failure rate, 1.11% annual failure rate. That’s at the low end of failures for 2TB WD’s that Backblaze saw in their stats (and theirs weren’t green power drives).

It will be interesting to see how the 6TB’s last over a 5 year timeframe. So much is due to change in this space over the next 1-2 years in enterprise, and 2-4 years (at most) in the consumer space. With large capacity SSD’s coming to market in the enterprise space and larger sata archive drives. Hopefully we’ll see cloud prices drop further too.

Spinning Disk