random complexity

with a side of crazy

Posts Tagged 'storage'

More hardwarey storagey stuff

A while back I bought some OCZ Vertex 2 (sandforce based) ssd's, which dropped in price 2 weeks later (as expected). I put one into my desktop when I rebuilt it, which was great, now there's definitely no bottleneck on IO, and even though it's only got 4gb of ram (shared video too - so more like 3.5gb) it now has fast swap. Obviously you don't want to swap ever, but if you do start dipping into swap you'd prefer it to be fast and not impacted by other random IO on the disk, so the ssd is great.

The other 2 ssd's I bought to use as a cache on my primary filer box. ZFS lets you have external caches for both read and write. The write cache is like the journal on a classic file system and is called the ZFS Intent Log (or ZIL). When the ZIL is external to the zpool, it's commonly called a separate log, or slog device. As the file system is built with failure in mind, you need to be aware of the various failure situations if you lose different devices. Losing the ZIL is bad, but no longer catastrophic (it used to be). Now you'll just lose all uncommitted changes to the disk, which is fine, and it won't corrupt the pool. Obviously losing data is bad, and it's always been recommended that the ZIL be on mirrored storage. To select the size of the ZIL there are some calculations related to how much data can be written in 10 seconds (and it's flushed out at least every 10 seconds), and also taking into account system RAM size too. Aiming way too high (but catering for growth maybe) I set mine at 8gb, mirrored.

The rest of the ssd's were to become the read cache, known as the L2ARC cache. The ARC (adjustable replacement cache) cache is a purely in ram cache, and L2ARC is the second level version to be used on fast secondary storage devices like 15K drives or ssds. Objects in this cache are still checksummed, so a device going bad can't corrupt anything (if cache is bad, just read off the primary storage). Due to this there is no point mirroring the read cache and by adding multiple devices you essentially stripe your cache, which is good. So the 2 ssd's were sliced up (Solaris partitions) with an 8gb slice for the mirrored ZIL, and the rest for the L2ARC. Using 60gb ssd's I've now got over 90gb of high speed read cache, the theory is this cache could read at over 500MB/s, in practice it's hard to tell. At least the ssd's are rated at 50000 IOPS for 4k random writes.

ssd partition layout

Half the idea behind all this was to improve power management. Say you're watching a film, the whole file can be pre-read into cache and then while watching, it can be served purely from the cache while the main disks have spun down. ZFS apparently has some very neat caching algorithms in it, and I'm probably not seeing the best behaviour because the box doesn't have sufficient system ram (only 2gb), but in it's defence it is a 5 year old box. A rebuild (apart from disks) is long overdue.

So to actually do all this, once the disk is sliced up (using format) you can simply add the log and cache vdev's as follows (with your device names):

zpool add <poolname> log mirror c2d0s1 c3d0s1
zpool add <poolname> cache c2d0s3 c3d0s3

Or if you were building the zpool from scratch (say with 8 disks) and all the above craziness (as you do):

zpool create <poolname> raidz2 c0t0d0 c0t1d0 c0t2d0 c0t3d0 c0t4d0 c0t5d0 c0t6d0 c0t7d0 \
     log mirror c2d0s1 c3d0s1 cache c2d0s3 c3d0s3

Which would leave you with a pool along these lines (once you've put heaps of data on it):

# zpool status
 pool: yawn
 state: ONLINE
 scrub: scrub completed after 14h31m with 0 errors on Wed Nov 17 01:41:46 2010

        NAME        STATE     READ WRITE CKSUM
        yawn        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            c0t0d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0
            c0t3d0  ONLINE       0     0     0
            c0t4d0  ONLINE       0     0     0
            c0t5d0  ONLINE       0     0     0
            c0t6d0  ONLINE       0     0     0
            c0t7d0  ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            c2d0s1  ONLINE       0     0     0
            c3d0s1  ONLINE       0     0     0
          c2d0s3    ONLINE       0     0     0
          c3d0s3    ONLINE       0     0     0

errors: No known data errors

# zpool iostat -v
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
yawn        12.7T  1.81T     26      5  3.20M   482K
  raidz2    12.7T  1.81T     26      5  3.20M   440K
    c0t0d0      -      -      6      1   550K  73.8K
    c0t1d0      -      -      6      1   550K  73.8K
    c0t2d0      -      -      6      1   550K  73.8K
    c0t3d0      -      -      6      1   550K  73.8K
    c0t4d0      -      -      6      1   550K  73.8K
    c0t5d0      -      -      6      1   550K  73.8K
    c0t6d0      -      -      6      1   550K  73.8K
    c0t7d0      -      -      6      1   550K  73.8K
  mirror    11.9M  7.99G      0      0      0  41.4K
    c2d0s1      -      -      0      0      0  41.4K
    c3d0s1      -      -      0      0      0  41.4K
cache           -      -      -      -      -      -
  c2d0s3    46.8G     8M      0      0  16.8K  98.6K
  c3d0s3    46.8G     8M      0      0  16.8K  98.3K
----------  -----  -----  -----  -----  -----  -----

Later if you ever need to, you can remove these cache/slog devices, say you wanted to use the ssd's elsewhere or needed the sata ports for other spinning disks. Cache can be removed from any version, and slog/zil can be removed as long as you're on zpool version 19 or above. You just have to be careful which command you use, like when adding devices as they each have different meaning.

zpool remove - removes a cache device or top level vdev (mirror of logs only). Mirror vdev name comes from the zpool status output.

zpool remove yawn c3d0s3
zpool remove yawn mirror-1

zpool detach - detaches a disk from a mirror vdev (can return it to a single device vdev if you remove the second last disk). Example below detaches c3d0s1 from a mirror, if it's not a mirror the command will error out. Detach works on any mirror vdev, not just logs.

zpool detach yawn c3d0s1

zpool attach - makes a mirror out of an existing vdev (single device, or mirror) Example attaches new device c3d0s1 as a mirror of already present device c2d0s1. Attach also works on any vdev, not just logs.

zpool attach yawn c2d0s1 c3d0s1

After using ZFS for over 2 years now I've come to really appreciate it. There's a definite learning curve to it, but no more than equivalent on Linux. Linux Raid + LVM + + how to resize/reshape and so on. Actually zfs might be simpler, as all the commands are clearly documented in one place and behave all the same. The best part of ZFS is knowing your data is not rotting away on disk, and the very easy incremental replication, and snapshots, and Solaris's smoking fast NFS server, and and and

Zfs experiment continued

So the zfs experiment continues. Upon the release of b129 I set off into the unknown on a voyage of dedupe. Which at first had the promise of lower disk usage, faster IO speeds and a warm fuzzy feeling deep down that you only get from awesome ideas becoming reality. ahem

Most sources say you need more ram, and that is true, what they don't say is how much ram for what size data set, which might be more useful to home users like me. My boxes have 2gb of ram each, and that is not enough for dedupe, no way near. Not if you have a 6 TB of randomish data. I might retry when I get to 8gb ram but not before. You see, if it can't keep the whole of the dedupe table in ram ALL the time, any write to a dedupe enabled volume will result in reads for the rest of the table, or at least seeks. So what I saw was a gradual slowdown while writing to the volume, I was determined to let it finish, to see what savings I would make, and then scrap it due to performance, but after waiting 16 days for the copy, I cancelled it.

The only way I found to even see the contents/size of the dudupe table (DDT) is: zdb -DD which results in an output like this

DDT-sha256-zap-duplicate: 416471 entries, size 402 on disk, 160 in core
DDT-sha256-zap-unique: 47986855 entries, size 388 on disk, 170 in core

DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    45.8M   5.69T   5.66T   5.66T    45.8M   5.69T   5.66T   5.66T
     2     394K   43.0G   40.3G   40.3G     821K   89.0G   83.0G   83.1G
     4    9.90K    527M    397M    402M    47.0K   2.35G   1.76G   1.79G
     8    2.06K    125M   82.4M   83.4M    21.1K   1.20G    795M    806M
    16      391   13.7M   8.54M   8.76M    7.26K    272M    162M    166M
    32       69   1.17M    776K    822K    3.08K   51.3M   32.7M   34.8M
    64       17    522K    355K    368K    1.43K   36.9M   25.1M   26.2M
   128        6    130K      7K   11.2K    1.07K   31.3M   1.50M   2.23M
   256        2      1K      1K   2.48K      833    416K    416K   1.01M
   512        4      2K      2K   4.47K    2.88K   1.44M   1.44M   3.32M
    2K        1     512     512   1.24K    2.79K   1.39M   1.39M   3.46M
 Total    46.2M   5.73T   5.70T   5.70T    46.7M   5.78T   5.74T   5.74T

dedup = 1.01, compress = 1.01, copies = 1.00, dedup * compress / copies = 1.01

Saving's of around 80gb with dedupe and compression (backup box so no real world performance requirement) is just not worth the need for 3-n times the ram and possibly an ssd for the l2arc cache to speed things up. Yep, the suggestion and observed behaviour was to hook up a cheap small (30gb) SSD for cache to accelerate it. I don't mind that so much for a primary but this is my backup/2nd copy box so it's not really ideal. Certainly not for 80gb of savings, or at current prices around $5 of disk.

My second attempt is now underway, this time I've sliced up my data sets into more volumes, and by more that means smaller average size, so this time around 2TB max per volume, which from experience at work I've learned is a good rule of thumb. So now I can enable compress+dedupe on only specific bits, hopefully where the most savings is to be made, and then the rest is just stored raw. This way the savings might be similar, but without the major write speed penalty. I've also realised for the production box if I want screaming performance, I'll throw an ssd on there, but that means more sata ports, which means a major change. I also need to work on power management too.

One thing that has gone right this time, is I'm now using CF->IDE adaptors and booting off that. This way the OS think's it's on a 2gb hdd, so booting doesn't have the complexity of usb boot and also uses less power and doesn't take up a sata port. Of course new boards don't have pata anymore so I might need to get a CF->sata one in future.

Another thing that must be said, Solaris's CIFS server is fast.

Copyright © 2001-2017 Robert Harrison. Powered by hampsters on a wheel. RSS.