This week, our whole SAP production environment was wiped out for a day. I wasn’t directly involved in the sleepless process that was the restore and subsequent tests, but it was very interesting to watch from my ivory tower.
First, it’s worth noting that Bluefin runs its whole SAP environment on VMWare ESX. We have some 40-something SAP systems in total (mostly demo systems) and we run something like 80% of our business on SAP technologies as well. It looks like we should get most of the other 20% onto SAP this year – completing our entire Demand to Cash process on the latest SAP technologies including ERP, CRM, BW, Portal, BPC and BI4. We eat our own dog food.
We invested in VMWare very early on and it has paid massive cost reduction dividends. We also invested in storage infrastructure in 2007, and it started to creak early this year, and a proposal to replace it is underway. In the interim – some 8 months ago, we invested in some Solid State Storage (SSD) for our productive environments.
On the face of it, this seemed like a really smart idea. Huge performance increases – allowing a BW upgrade in 2 hours for example – and a stop gap to keep our production systems fast whilst we decided what to do with our main storage fabric.
Why is SSD so fast?
Well SSD in the Enterprise is much like it is for my MacBook Air. You don’t have any moving parts and so it doesn’t have to search to find stuff. Instead, access to information is much faster – around 50x – compared to the spinning plates of metal you have in regular hard disks.
This means that for loads that require a lot of data to be accessed, SSD massively outperforms regular magnetic storage. It sped our productive environment right up and everyone in the company got to see a benefit of some sort.
It’s worth noting at this stage that we put the SSD disks in a RAID5 configuration – meaning that there was some redundancy. There are 6 disks in this case – and it possible to lose one disk, with the others picking up the slack whilst the failed disk is replaced. It turns out that this was our downfall, as you will see later.
It’s also worth noting that the SSD storage that goes into Enterprise equipment isn’t the same stuff you get in the MacBook Air and other consumer devices. It’s faster and it has a much lower fault rate. SSD storage, by the way, has a fault rate which is the amount of information that’s written to it. Past a certain point, your chances of failure go up exponentially.
What happened next?
So around a month ago, we lost all our production systems for a short while. Dell replaced the backplane (the part that connects the disks to the system) on the storage system and it started working again; they tested it in their labs and didn’t find a problem but we had assumed at that time that it was a blip. In retrospect, it was probably a warning shot.
And on Monday night, the first SSD drive failed. This was promptly replaced and during the ensuing rebuild process, all of the other 5 drives failed. We lost all of our systems and had to restore to our other VMWare farm – a time consuming process that involved turning other systems off.
At first glance, this looks utterly bizarre, but it turns out that this is a feature of SSD storage – a dirty little secret, if you like. Remember these disks were only 8 months old.
Why do SSD drives fail en masse?
The answer turns out to be really simple and it is very well technically explained in some dude called Ray’s blog.
But basically SSD disks last for a certain number of write operations. And with RAID5, which we use, that number is halved for technical reasons. Using a database like Microsoft SQL Server and putting logs and data on the same disk group (which we did, and is quite normal in VMWare), you massively decrease that again. It turns out that in the environment we have, SSD disks just won’t last.
So one of the SSD disks will fail sooner or later, which is fine (you replace it). But what happens next isn’t fine. During the rebuild process – which takes an hour or so, there is a substantial additional load placed on the other SSD disks. Because they all have the same failure write rate, and RAID5 balances the load very well across the 6 disks, there is a huge risk that they are all ready to fail at the same time.
Add this to the increased load and you have a very likely possibility that the failure will just cascade and ruin the rest of the disks in the group.
What can we do about this?
Well, short of not using SSD storage? There seem to be a few options:
1) Don’t use RAID1 or RAID5 (or RAID50, RAID10 etc.). Instead use an unbalanced RAID like RAID4 that puts all the stress on a single disk. This one will fail much more often, which is fine.
2) Don’t put database logs on SSD. Put them on regular magnetic storage – they just massively decrease the time time to failure.
3) Advances in SSD storage mean that they will be able to predict failure soon and tell you when to replace them before they go bang.
4) Mix age and/or vendors of SSD drives within a storage group. This will mean that not all the drives fail at once.
5) SSD technology is advancing fast and they are becoming more reliable with each release. This is partially down to smart electronics that balance writes and increase overall reliability.
It seems that despite massive advances in SSD storage, it’s not ready for the prime time yet. If you are considering investing in SSD then research whether the storage system you are buying takes the 4 aspects above into account – and challenge the vendor to how they will work around the problems described.
It’s also fair to say that our IT organisation wasn’t to blame for this. It’s not a well documented phenomenon and we wanted to know how SSD storage worked in the real world of Enterprise IT. They worked long and hard to fix the problems and Chris turned up in the office last night at 7pm looking like death warmed up.
That said, it is clear that SSD is the future. Spinning plates of metal so that we can store information is plain daft.