random complexity

with a side of crazy

Posts Tagged 'storage'


File server rebuild

So the long planned rebuild of my home storage box had been through a few different revisions over the months. Originally I was pretty set on the backblaze case and design. Aiming more for bulk storage than fast. I even went as far as procuring the 9 port mulipliers for this however never found a cheap enough way to get the bare chassis. When comparing the price that the complete backblaze option would cost it even made little sense for anything other than sheer bulk storage.

Next up I was considering getting an entry level NetApp. As I'm in the industry the pricing I could get was looking pretty good - however it was still going to be a tough sell. Limited expandability due to high cost of disk shelves vs the up front system cost. When you factor in the up front cost including ALL disks over 3 years it started to look not so terrible. But it all fell apart when I thought about how to back it up (my current system is backed up zfs send/recv style) which NetApp can do (SnapMirror) however I couldn't simply splurge for a 2nd one. Also with a pretty firm price floor of about $6000 and not having any option on expansion I just couldn't justify it - too many dollars for too little storage. Sure it'd be fast, easy to manage, reliable and quality, for 2 times the price of a DIY box but unable to back it up. Backups are important, very important.

So after that I decided to go down a road quite a few others have too. Off the shelf 4RU 24bay case and cheap LSI SAS HBA's. The only downside was the limitation of 24 drives. To go beyond that I'd either need to build a complete second box, or buy the disk shelf version and additional HBA's. For now that's a future me problem. Fortunately the case was cheap compared to a backblaze one, and included the hotswap backplanes. In a few ways this is the same as the expansion problem of the NetApp (buy a second or high priced shelves) but at least backing it up is possible (current backup box still works) and because I don't need to buy disks it costs only a third of the NetApp.

Regardless of the overall system design, I'd already selected a mainboard & cpu combination. An Intel CPU obviously, and a server class board to ensure enough sufficiently wide PCIe slots.

So the overall parts list is this (and rough price);

  • Intel Xeon E3 CPU (I got the E3-1230v2 3.3ghz) ($270)
  • Intel S1200BTL server board ($240)
  • 16GB ECC Registered Unbuffered memory (2x8GB, so I can upgrade to 32GB later) ($240)
  • Norco 4224 chassis ($460)
  • Norco replacement fan wall for 120mm fans ($10)
  • 3x 120mm fans (quiet but high flow) ($12 each, $36)
  • 3x Intel SASUC8I HBA's. (LSI 1068 based, each have 2x SFF8087 connectors, so support 8 drives each) ($163 each (inc freight), $489)
  • 6x SFF8087-SFF8087 SAS cables ($60)
  • Corsair HX1000 PSU (modular PSU) (Already had, the newer model is $250)
  • Terminal blocks, molex wiring looms, velcro etc ($30)
  • Subtotal : $2085 (without drives)

I started off by fitting the replacement fan wall and fans. Then I marked and drilled the sides of the rear area to support 3x 2.5" SSD's mounted internally. As I was intending on booting from a USB drive and using rear mounted SSD's for cache.

Next up I worked out how I was going to wire the hot swap backplanes power. Each of these horizontal blades support 4 drives and have 2x standard 4pin molex power connectors. As my power supply has two separate 12V rails I wanted to try to balance these rails as evenly as possible. So I wired up the blades to alternating supplies - even numbered to one, and odd numbered to the other. This resulted in me using some terminal blocks and wiring off to the modular power connectors directly.

Here's the result.

Power cabling

Power rails

After this I fitted and tested the powersupply and fans. Then the mainboard, CPU and RAM were installed. Now it was ready for some initial testing. Then the SAS HBA's and further testing.

Once the SAS HBA's were installed I was able to map out which slot mapped to which device. Fortunately this wasn't hard to figure out and somehow I had 11 spare disks to assist with this mapping out. On this mainboard, the first 3 PCIe slots were direct to the CPU, and the rest were off the chipset, so I used the top 3 slots (which I number 1-3 from CPU working away, see pic further down). What I ended up with was;

HBA to device mapping

I also took the time to reflash the cards from integrated RAID (IR) mode to initiator target (IT) mode as in this mode it's not so fast to kick out a possibly failing disk (which allows the use of green power drives with fewer issues). After a bit of messing around in UEFI to do this I ended up booting the Win7 install disk and using the command prompt there.

HBA's showing IT mode

Now for the software side of things. I'd been working on my own OpenSolaris/OpenIndiana derived NAS distribution for a while, but was also interested in trying OmniOS and SmartOS on the hardware. What I found was SmartOS wasn't the best idea for what I wanted. OmniOS was a good start but I'd want to build my idea of a NAS system on top of it (already on my todo list in fact). So for the time being I'd still run my own system.

Then I tried running this within a VM under ESXi. I'd read of a few people doing this with great success however I was skeptical at first - partly due to concerns over how failure situations would be handled. Would a failing disk cause the system instability or worse? However after playing with it and swapping disks hot, I'm much more confident that passing through the whole PCIe card solves any concerns there. The important thing is you need a mainboard and CPU that support the VT-D feature.

Direct IO

The other concern with running a virtualised file server was around performance and latency. I'm glad to report there is such a minimal difference on this hardware that it's not worth thinking about. I did some benchmarks with NetApp's Simulate IO (SIO) tool which shows nearly identical performance before/after. These were done with a 10 disk raidz2 zpool made up of 500GB WD blue drives spread across 3 HBA's.

The exact command line used was sio_ntap_win32.exe 0 0 64K 1900m 90 50 V:\testfile. The parameters are Read Percentage, Random Percentage, Block size, File Size, Duration (seconds), Thread count, Filename. Using 0 for Read% means only write, and 0 for Random% means sequential. The point of this tool is to simulate IO, so we deliberately use a high number of threads (50) to cause high IO, and a large size to reduce the benefit of memory caching on client and server. For comparison I've included a local SSD run of the same thing. The most relevant figures in here for comparison are IOPS and KB/s each of these are best of two runs.

Baremetal install

SIO_NTAP:
Inputs
Read %:         0
Random %:       0
Block Size:     65536
File Size:      1992294400
Secs:           90
Threads:        50
File(s):        V:\testfile
Outputs
IOPS:           564
KB/s:           36080
IOs:            50738

Vmware virtual machine install

SIO_NTAP:
Inputs
Read %:         0
Random %:       0
Block Size:     65536
File Size:      1992294400
Secs:           90
Threads:        50
File(s):        V:\testfile
Outputs
IOPS:           521
KB/s:           33358
IOs:            46910

Single SATA2 Sandforce based SSD (Win7 NTFS)

SIO_NTAP:
Inputs
Read %:         0
Random %:       0
Block Size:     65536
File Size:      1992294400
Secs:           90
Threads:        50
File(s):        C:\Users\robert\Desktop\testfile
Outputs
IOPS:           2445
KB/s:           156497
IOs:            220074

Direct disk copies from my desktop PC's SSD's were limited only by the network. 112MB/s write speed sustained for 30GB (using 4GB test files) over CIFS. Baremetal and virtualised had the same speed - no difference at all.

Next I had to decide how the cache disks would work. With ESXi on baremetal booting off USB, I had an SSD as a datastore to contain the filers boot disk. To provide the cache disks I had a few options:

  • put them on the SAS cards (losing hot swap bays)
  • attempt a whole disk passthrough in VMware (RDM?)
  • put a datastore on it and assign a large vmdk to the guest
  • or probably a few other options

It didn't look like I could pass through a whole disk off the onboard controllers, so that was out which left me with a datastore layer of overhead or losing hot swap bays.

Then I had to decide on how to expand my zpool. Going into this upgrade I was using some very old hardware (5-7 years old) with drives that were about 2 years old (oldest 5 were 6/feb/2010). There were 10x 2TB WD Green drives in this raidz2 zpool, one with known bad sectors. Earlier in the year when I had purchased an additional two drives due to disk failures that turned out to be a failing mainboard. So I had 12 disks to work with, 1 a little bit dodgy (64kB bad out of 2TB).

Working from the ZFS optimal raid size plan, I decided the next optimal size up from where I was would be one of the following;

  • 2x vdev's made up of 10 disks raidz2. 20 drives total, 4 parity disks (yes I know it's striped parity).
  • 1x vdev made up of 19 disks raidz3. 19 drives total, 3 parity disks.

So if I allow my dodgy disk to be used as a hot spare, the second option gives me a wider stripe of 19 disks, ultimately better protection to multiple disk failures and still 20 disks in the chassis. Finally I decided to put the cache disks onto the hot swap trays as I had 4 bays free. Why not put 3 SSD's in there then. So that's the plan. Right now there's 2 in there with a 3rd going in once it's been reclaimed from it's current machine. The SSD's have been partitioned (GPT) to have a 2GB slice at the front for ZIL, and the rest of the disk for L2ARC. ZIL mirrored, L2ARC not.

Slot drive type

Initial Data Seeding.

To copy my data on I used zfs send/receive via a utility called mbuffer. Mbuffer helps smooth out any drops or bursts in IO on the sending side to help maintain a higher average speed of transfer over the network. In the past I have had some issues with this when sending a whole dataset. This time around however I had no such issues and was able to copy the entire dataset in one continuous operaion.

summary: 12.5 TByte in 39 h 49 min 91.4 MB/s

Now for some final benchmarking (this is with 100GB L2ARC, 2GB ZIL and the 19 disk raidz3 of 2TB WD Green drives)

CIFS: sio_ntap_win32.exe 0 0 64K 1900m 90 50 V:\testfile

SIO_NTAP:
Inputs
Read %:         0
Random %:       0
Block Size:     65536
File Size:      1992294400
Secs:           90
Threads:        50
File(s):        V:\testfile
Outputs
IOPS:           1112
KB/s:           71191
IOs:            100112

NFS4: ./sio_ntap_linux 0 0 64K 1900m 90 50 /storage/siotest/testfile

SIO_NTAP:
Inputs
Read %:     0
Random %:   0
Block Size: 65536
File Size:  1992294400
Secs:       90
Threads:    50
File(s):    /storage/siotest/testfile 
Outputs
IOPS:       1342
KB/s:       85915
IOs:        120818

Big test (working file size > memory + l2arc size):

CIFS: sio_ntap_win32.exe 0 0 64K 140g 300 50 V:\testfile2

SIO_NTAP:
Inputs
Read %:         0
Random %:       0
Block Size:     65536
File Size:      150323855360
Secs:           300
Threads:        50
File(s):        V:\testfile2
Outputs
IOPS:           727
KB/s:           46506
IOs:            217997

NFS4: ./sio_ntap_linux 0 0 64K 140g 300 50 /storage/siotest/testfile2

SIO_NTAP:
Inputs
Read %:     0
Random %:   0
Block Size: 65536
File Size:  150323855360
Secs:       300
Threads:    50
File(s):    /storage/siotest/testfile2 
Outputs
IOPS:       1503
KB/s:       96197
IOs:        450923

And now for a real world comparison I ran the same tests on an idle (not in production) NetApp FAS2240-2. However as the test machine was not the same I had to perform benchmarks of my system again from this client. It turned out the test machine is a pile of crap when it comes to network load testing.

CIFS to NetApp FAS2240-2 (19 disk RAID-DP aggregate of 600GB 10k SAS disks - tested from a dual core, 6gb ram laptop via a crossover cable) sio_ntap_win32.exe 0 0 64K 1900m 90 50 Z:\testfile

SIO_NTAP:
Inputs
Read %:         0
Random %:       0
Block Size:     65536
File Size:      1992294400
Secs:           90
Threads:        50
File(s):        Z:\testfile
Outputs
IOPS:           490
KB/s:           31349
IOs:            44082

CIFS to my system, same laptop. sio_ntap_win32.exe 0 0 64K 1900m 90 50 Z:\testfile

SIO_NTAP:
Inputs
Read %:         0
Random %:       0
Block Size:     65536
File Size:      1992294400
Secs:           90
Threads:        50
File(s):        Z:\testfile
Outputs
IOPS:           414
KB/s:           26520
IOs:            37299

CIFS to the NetApp again, 49GB test size (didn't have time for mkfile to produce a larger file) (volume not deduped, no flash cache, no flash pool, controller has 6GB RAM, 768MB NVMEM). sio_ntap_win32.exe 0 0 64K 49g 300 50 Z:\testfile2

SIO_NTAP:
Inputs
Read %:         0
Random %:       0
Block Size:     65536
File Size:      52613349376
Secs:           300
Threads:        50
File(s):        Z:\testfile2
Outputs
IOPS:           775
KB/s:           49609
IOs:            232529

CIFS to my system, same laptop, large file test (140gb - unfortunately I didn't test with a 49gb file to equal comparison however that could have fit in L2ARC so wouldn't have been fair anyway) sio_ntap_win32.exe 0 0 64K 140g 300 50 Z:\testfile2

SIO_NTAP:
Inputs
Read %:         0
Random %:       0
Block Size:     65536
File Size:      150323855360
Secs:           300
Threads:        50
File(s):        Z:\testfile2
Outputs
IOPS:           322
KB/s:           20616
IOs:            96639

CIFS to my system, from original desktop pc (showing just how crap that laptop is - tests run on same day as the laptop tests for a control sample) sio_ntap_win32.exe 0 0 64K 140g 300 50 Z:\testfile2

SIO_NTAP:
Inputs
Read %:         0
Random %:       0
Block Size:     65536
File Size:      150323855360
Secs:           300
Threads:        50
File(s):        Z:\testfile2
Outputs
IOPS:           854
KB/s:           54676
IOs:            256301

Interpreting the SIO results can be a bit of dark voodoo. Unfortunately I wasn't able to test the NetApp with a more realistic system - the laptop is clearly crap achieving only 37% of the IOPS my desktop could achieve over the same network. Ignoring the crap laptop for now, this shows that the NetApp is clearly superior (as what should be expected) however my a much smaller margin that I had expected. On the small test (which would fit in the ram of both systems) the NetApp achieves 18% more IOPS (and throughput). For the large test the gap widens dramatically (however test sizes were different). I'd be willing to bet the NetApp had much more headroom available for load than my system did - of course this wouldn't be visible with such a crap test machine. Due to this I think these tests are flawed and totally useless, apart from proving that my work laptop fails at networking.

One thing I did notice while running these tests which I'd never seen before, was large use of the ZIL. Previously when I had a mirrored ZIL on SSD's I'd allocated 8GB for it, however I'd never seen it above about 200MB ever. I based 8GB on the old "how much data could you ingest in 30 seconds, and double it". Allowing for a maxed out 1Gbit interface, 8GB seemed a good number. However I never saw it anyway near used. So this time around I worked out a 2GB number from a more conservative "how much data could you ingest in 8 seconds, doubled", and working from 125MB/s. 8 seconds because the default flush interval is 5 seconds. In practice when writing flat out the disks are all only hit for a burst every 5 seconds. Part of this ZIL sizing comes from my NetApp experience where the NVRAM/NVMEM performs ultimately the same function (but is battery backed for power loss/crash consistency). Only the biggest NetApp system has 8GB NVRAM and it can easily write from a filled 10GbE interfaces out to over 1000 disks. Consider the FAS2240 I've been comparing it to, that has 768MB (which also is halved if used in an HA pair because it's a mirror of the partner's NVMEM too). This suggests I might be around the right ballpark even though the comparison is not totally apples:apples.

During the large file NFS SIO tests above I ran a quick zpool iostat -v 5 300 and spotted ZIL usage above 1.5GB! Fortunately it didn't stay there and hovered around 1GB for most of the test. Perhaps 2GB is close to correct, if not slightly too small for this system? Following is the zpool iostat while running NFS SIO tests;

                 capacity     operations    bandwidth
pool          alloc   free   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
marlow        16.0T  18.5T     96  1.68K  12.1M   146M
  raidz3      16.0T  18.5T     96    605  12.1M  70.2M
    c5t2d0        -      -     79     47   645K  4.64M
    c5t3d0        -      -     79     75   665K  4.63M
    c5t4d0        -      -     79     48   655K  4.65M
    c5t5d0        -      -     79     69   653K  4.64M
    c5t6d0        -      -     78     90   646K  4.63M
    c5t7d0        -      -     79     64   649K  4.64M
    c6t1d0        -      -     79     47   653K  4.65M
    c6t2d0        -      -     79     45   658K  4.65M
    c6t3d0        -      -     79     45   660K  4.65M
    c6t4d0        -      -     80     45   658K  4.65M
    c6t5d0        -      -     79     45   658K  4.65M
    c6t6d0        -      -     79     45   663K  4.65M
    c6t7d0        -      -     79     45   651K  4.65M
    c7t1d0        -      -     78     51   643K  4.64M
    c7t2d0        -      -     78     47   646K  4.65M
    c7t3d0        -      -     78     71   644K  4.64M
    c7t4d0        -      -     79     72   653K  4.63M
    c7t5d0        -      -     78     45   650K  4.65M
    c7t6d0        -      -     79     47   656K  4.65M
logs              -      -      -      -      -      -
  mirror      1.51G   485M      0  1.09K      0  76.1M
    c6t0d0s0      -      -      0  1.09K      0  76.1M
    c7t0d0s0      -      -      0  1.09K      0  76.1M
cache             -      -      -      -      -      -
  c6t0d0s1    53.9G      0     32    142  4.03M  17.7M
  c7t0d0s1    53.9G      0     33    135  4.14M  16.8M
------------  -----  -----  -----  -----  -----  -----

And for the overall happy snaps. This is a nearly finished internal shot of the rear part of the case, showing the mainboard, two cooling fans (which now have 50ohm resistors in series to slow them down) and the nearly finished cable routing. The PCI cards are HBA1, HBA2, HBA3 from top to bottom. On the Intel SASU8CI card, the connector closest to the mainboard is SAS ports 0-3, and the other connector is 4-7. Since this photo was taken I've also added 2 more 8GB memory modules (taking it to 32GB total), the Intel Remote Management Module (RMM) (which gives me remote console, remote cdrom/usb capability) and an Intel Quad port ethernet card.

Filer Overview

Front view showing the UPS too. Despite being brand new I do have a failed drive presense LED on bay 15 (4th down on left) which I need to find out about replacing.

Filer Front

Yikes, this has turned into quite a big write up. In a future post I'll go into more detail of the software side particularly my custom openindiania thing.

What would happen if I took the red and blue pill at the same time

Some quick Amazon Glacier numbers

Amazon Glacier. A very nice idea, a nice storage price (1 cent / GB month). Pricing up of restores is a bit more complicated but seems to work ok for large backups with small infrequent restores. Perhaps using it as a full DR target is a bit of a stretch due to the huge cost of restoring everything quickly.

As great as it sounds, I needed to get my head out of the cloud. So to bring it back to earth, how would it apply to me.

Hypothetically if you stored 10TB in it, that would be roughly $100/month in storage.

5% of that could be restored for free in any month, so 500GB. But that's daily prorated, so 16.6GB/day for 30 days. If you exceed this you pay based on the maximum hourly transfer volume you achieved (less the hourly prorated free allowance (0.694GB in this example)) multiplied by the number of hours in the month (720 for 30 days) multiplied by the excess fee of $0.01/GB. Gulp.

Also each restore job is only available for 24hours, so that would be 30 restore jobs not just one. I didn't see anything saying a limit to how many jobs can be done in a month or a per restore fee.

If you delete something within 3 months of uploading you pay a prorated fee per GB for deleting it. Best off leaving it there for 3 months because it'll cost you the same either way.

Now for some real world more relevant figures. Australian ADSL2 is marketed as up to 24Mbps downstream (and is 1Mbps upstream unless you pay for Annex-M which is about 2Mbps upstream).

Glacier uploads are free (nearly), and can be multipart uploads to consolidate into one archive. Nearly free because transfer is free but requests are $0.05 per 1000.

With 1Mbps upload, say you can sustain 110KB/s (90% of theoretical max) for a whole 30 days, that would be nearly 272GB uploaded, which would cost $2.72/month to store. So to perform an initial seeding of an off site 10TB replica into Glacier it would take my DSL connection 36 months of continuous uploading. This would run up a Glacier storage bill of $1811 (y=0.5ax^2+0.5ax where a = monthly additional charge, and x is month).

On 2Mbps the duration to upload would halve (18 months) so the running costs don't quite accrue as high, it comes to $930.

Now consider NBN with 40Mbit upload. Assuming the same 90% utilisation it should be good for 4394KB/s, or 362GB per day. Assuming Amazon can sustain that from an Aussie source IP. Now assuming the same 30 day months, that's 10860GB in the first month, which would cost $108/month to store. Now that is a realistic baseline seeding duration.

However, internet quotas would still come into play. With 1TB plans available it would still take 10 months. 10 Months uploading $10/month storage addition per month, comes to $550 storage fees for the initial seed.

So bottom line, even if NBN comes to my house, I wouldn't be able to backup to Glacier unless quotas increased dramatically (even temporarily).

Double facepalm

Disclaimer: my numbers might be off, probably because this whole post was knocked together in about 45 minutes. Record timing!

Blink and you'll miss it

Wow another few weeks fly by. Lets see. I put my Wii onto ebay and it sold for just over $200. I've spent a few days learning python and mucking around with it in modpython and then modwsgi. I'll be playing with it and mongoDB for a small project that I'm building to scale up. I finished reading The Art of Being Minimalist, and started on Minimalist Business, both are highly a recommended read.

The DVD ripping project has progressed further. I've finished all movies and made a good stab at the TV shows. This lead to a much more in-depth understanding of de-interlacing filters than I would have preferred. As the perfectionist in me decided it needed to be right I had to find the best settings. Many of the disk's I have are poorly mastered that there is little you can do to fix errors, however some come up brilliantly with just the right settings. I noticed that usually NTSC/USA sourced material (even region 4 PAL versions of USA content) come up OK, but PAL sourced material (ABC/BBC) stuff really needs a lot of work to make it not look like a total dogs breakfast. Some disk's were even interlaced and then de-interlaced with the wrong field order prior to being re-interlaced to go onto DVD. The result was essentially unfix-able. De-interlacing video correctly is like voodoo, and as much as I have no respect for so called professionals who screw it up, it really does take skill, knowledge, practice and some luck to get it right. That aside, there is no excuse for screwing up the final product especially if it's being mass produced and sold. When I pay full price for something, I expect it to be right, not half assed. Free stuff can be half assed, because it's free, fine. It's breaking the old adage, you get what you pay for. It's things like that which make people think the vodafone network has good coverage or speed.

That's similar to the reason I stopped buying DVD boxed sets. You get screwed by supporting early, and rewarded for jumping in late or after the boats left. Don't ever buy a show season by season, because after the show's finished they'll release a box of the lot, which will be better, for less money. Or while the show is airing, they'll change the cover/box artwork so your neat set doesn't match. I've been burned enough times to know that it's not a once off or my bad luck, it's standard procedure, so my response is to vote with my feet and not play that game.

While all this was going on, more house cleaning was done, I like it when tasks complement each other. The feeling of achieving a result is great. So I filled more bins full of crap and they have been dumped. I'll soon be putting up a list of DVDs which are essentially free to a good home for a token amount of cash back.

Coffee Cat cup

We're nearly through the silly season now. So much craziness. As the year is about to wrap up, perhaps a review of last year's new years resolutions. I aimed low this year, and managed to achieve. I simply wanted to "Not eat maccas all year" or really, to see how long I could last without going. Well as it turns out I haven't eaten there all year (since mid December 2009), and only drank coffee from Mc Cafe twice, in October, while in Esperance prior to the Tour De Freedom 1000. Even that was my fault, as I hadn't yet found the Coffee Cat who despite being busy and slow (it was holiday time) manages to produce something that could pass for coffee in Perth. Even if it's not just a coffee cup.

More hardwarey storagey stuff

A while back I bought some OCZ Vertex 2 (sandforce based) ssd's, which dropped in price 2 weeks later (as expected). I put one into my desktop when I rebuilt it, which was great, now there's definitely no bottleneck on IO, and even though it's only got 4gb of ram (shared video too - so more like 3.5gb) it now has fast swap. Obviously you don't want to swap ever, but if you do start dipping into swap you'd prefer it to be fast and not impacted by other random IO on the disk, so the ssd is great.

The other 2 ssd's I bought to use as a cache on my primary filer box. ZFS lets you have external caches for both read and write. The write cache is like the journal on a classic file system and is called the ZFS Intent Log (or ZIL). When the ZIL is external to the zpool, it's commonly called a separate log, or slog device. As the file system is built with failure in mind, you need to be aware of the various failure situations if you lose different devices. Losing the ZIL is bad, but no longer catastrophic (it used to be). Now you'll just lose all uncommitted changes to the disk, which is fine, and it won't corrupt the pool. Obviously losing data is bad, and it's always been recommended that the ZIL be on mirrored storage. To select the size of the ZIL there are some calculations related to how much data can be written in 10 seconds (and it's flushed out at least every 10 seconds), and also taking into account system RAM size too. Aiming way too high (but catering for growth maybe) I set mine at 8gb, mirrored.

The rest of the ssd's were to become the read cache, known as the L2ARC cache. The ARC (adjustable replacement cache) cache is a purely in ram cache, and L2ARC is the second level version to be used on fast secondary storage devices like 15K drives or ssds. Objects in this cache are still checksummed, so a device going bad can't corrupt anything (if cache is bad, just read off the primary storage). Due to this there is no point mirroring the read cache and by adding multiple devices you essentially stripe your cache, which is good. So the 2 ssd's were sliced up (Solaris partitions) with an 8gb slice for the mirrored ZIL, and the rest for the L2ARC. Using 60gb ssd's I've now got over 90gb of high speed read cache, the theory is this cache could read at over 500MB/s, in practice it's hard to tell. At least the ssd's are rated at 50000 IOPS for 4k random writes.

ssd partition layout

Half the idea behind all this was to improve power management. Say you're watching a film, the whole file can be pre-read into cache and then while watching, it can be served purely from the cache while the main disks have spun down. ZFS apparently has some very neat caching algorithms in it, and I'm probably not seeing the best behaviour because the box doesn't have sufficient system ram (only 2gb), but in it's defence it is a 5 year old box. A rebuild (apart from disks) is long overdue.

So to actually do all this, once the disk is sliced up (using format) you can simply add the log and cache vdev's as follows (with your device names):

zpool add <poolname> log mirror c2d0s1 c3d0s1
zpool add <poolname> cache c2d0s3 c3d0s3

Or if you were building the zpool from scratch (say with 8 disks) and all the above craziness (as you do):

zpool create <poolname> raidz2 c0t0d0 c0t1d0 c0t2d0 c0t3d0 c0t4d0 c0t5d0 c0t6d0 c0t7d0 \
     log mirror c2d0s1 c3d0s1 cache c2d0s3 c3d0s3

Which would leave you with a pool along these lines (once you've put heaps of data on it):

# zpool status
 pool: yawn
 state: ONLINE
 scrub: scrub completed after 14h31m with 0 errors on Wed Nov 17 01:41:46 2010
config:

        NAME        STATE     READ WRITE CKSUM
        yawn        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            c0t0d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0
            c0t3d0  ONLINE       0     0     0
            c0t4d0  ONLINE       0     0     0
            c0t5d0  ONLINE       0     0     0
            c0t6d0  ONLINE       0     0     0
            c0t7d0  ONLINE       0     0     0
        logs
          mirror-1  ONLINE       0     0     0
            c2d0s1  ONLINE       0     0     0
            c3d0s1  ONLINE       0     0     0
        cache
          c2d0s3    ONLINE       0     0     0
          c3d0s3    ONLINE       0     0     0

errors: No known data errors

# zpool iostat -v
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
yawn        12.7T  1.81T     26      5  3.20M   482K
  raidz2    12.7T  1.81T     26      5  3.20M   440K
    c0t0d0      -      -      6      1   550K  73.8K
    c0t1d0      -      -      6      1   550K  73.8K
    c0t2d0      -      -      6      1   550K  73.8K
    c0t3d0      -      -      6      1   550K  73.8K
    c0t4d0      -      -      6      1   550K  73.8K
    c0t5d0      -      -      6      1   550K  73.8K
    c0t6d0      -      -      6      1   550K  73.8K
    c0t7d0      -      -      6      1   550K  73.8K
  mirror    11.9M  7.99G      0      0      0  41.4K
    c2d0s1      -      -      0      0      0  41.4K
    c3d0s1      -      -      0      0      0  41.4K
cache           -      -      -      -      -      -
  c2d0s3    46.8G     8M      0      0  16.8K  98.6K
  c3d0s3    46.8G     8M      0      0  16.8K  98.3K
----------  -----  -----  -----  -----  -----  -----

Later if you ever need to, you can remove these cache/slog devices, say you wanted to use the ssd's elsewhere or needed the sata ports for other spinning disks. Cache can be removed from any version, and slog/zil can be removed as long as you're on zpool version 19 or above. You just have to be careful which command you use, like when adding devices as they each have different meaning.

zpool remove - removes a cache device or top level vdev (mirror of logs only). Mirror vdev name comes from the zpool status output.

zpool remove yawn c3d0s3
zpool remove yawn mirror-1

zpool detach - detaches a disk from a mirror vdev (can return it to a single device vdev if you remove the second last disk). Example below detaches c3d0s1 from a mirror, if it's not a mirror the command will error out. Detach works on any mirror vdev, not just logs.

zpool detach yawn c3d0s1

zpool attach - makes a mirror out of an existing vdev (single device, or mirror) Example attaches new device c3d0s1 as a mirror of already present device c2d0s1. Attach also works on any vdev, not just logs.

zpool attach yawn c2d0s1 c3d0s1

After using ZFS for over 2 years now I've come to really appreciate it. There's a definite learning curve to it, but no more than equivalent on Linux. Linux Raid + LVM + + how to resize/reshape and so on. Actually zfs might be simpler, as all the commands are clearly documented in one place and behave all the same. The best part of ZFS is knowing your data is not rotting away on disk, and the very easy incremental replication, and snapshots, and Solaris's smoking fast NFS server, and and and

Copyright © 2001-2016 Robert Harrison. Powered by hampsters on a wheel. RSS.