Posts Tagged 'hardware'

Nov 18

A while back I bought some OCZ Vertex 2 (sandforce based) ssd's, which dropped in price 2 weeks later (as expected). I put one into my desktop when I rebuilt it, which was great, now there's definitely no bottleneck on IO, and even though it's only got 4gb of ram (shared video too - so more like 3.5gb) it now has fast swap. Obviously you don't want to swap ever, but if you do start dipping into swap you'd prefer it to be fast and not impacted by other random IO on the disk, so the ssd is great.

The other 2 ssd's I bought to use as a cache on my primary filer box. ZFS lets you have external caches for both read and write. The write cache is like the journal on a classic file system and is called the ZFS Intent Log (or ZIL). When the ZIL is external to the zpool, it's commonly called a separate log, or slog device. As the file system is built with failure in mind, you need to be aware of the various failure situations if you lose different devices. Losing the ZIL is bad, but no longer catastrophic (it used to be). Now you'll just lose all uncommitted changes to the disk, which is fine, and it won't corrupt the pool. Obviously losing data is bad, and it's always been recommended that the ZIL be on mirrored storage. To select the size of the ZIL there are some calculations related to how much data can be written in 10 seconds (and it's flushed out at least every 10 seconds), and also taking into account system RAM size too. Aiming way too high (but catering for growth maybe) I set mine at 8gb, mirrored.

The rest of the ssd's were to become the read cache, known as the L2ARC cache. The ARC (adjustable replacement cache) cache is a purely in ram cache, and L2ARC is the second level version to be used on fast secondary storage devices like 15K drives or ssds. Objects in this cache are still checksummed, so a device going bad can't corrupt anything (if cache is bad, just read off the primary storage). Due to this there is no point mirroring the read cache and by adding multiple devices you essentially stripe your cache, which is good. So the 2 ssd's were sliced up (Solaris partitions) with an 8gb slice for the mirrored ZIL, and the rest for the L2ARC. Using 60gb ssd's I've now got over 90gb of high speed read cache, the theory is this cache could read at over 500MB/s, in practice it's hard to tell. At least the ssd's are rated at 50000 IOPS for 4k random writes.

ssd partition layout

Half the idea behind all this was to improve power management. Say you're watching a film, the whole file can be pre-read into cache and then while watching, it can be served purely from the cache while the main disks have spun down. ZFS apparently has some very neat caching algorithms in it, and I'm probably not seeing the best behaviour because the box doesn't have sufficient system ram (only 2gb), but in it's defence it is a 5 year old box. A rebuild (apart from disks) is long overdue.

So to actually do all this, once the disk is sliced up (using format) you can simply add the log and cache vdev's as follows (with your device names):

zpool add <poolname> log mirror c2d0s1 c3d0s1
zpool add <poolname> cache c2d0s3 c3d0s3

Or if you were building the zpool from scratch (say with 8 disks) and all the above craziness (as you do):

zpool create <poolname> raidz2 c0t0d0 c0t1d0 c0t2d0 c0t3d0 c0t4d0 c0t5d0 c0t6d0 c0t7d0 \
     log mirror c2d0s1 c3d0s1 cache c2d0s3 c3d0s3

Which would leave you with a pool along these lines (once you've put heaps of data on it):

# zpool status
 pool: yawn
 state: ONLINE
 scrub: scrub completed after 14h31m with 0 errors on Wed Nov 17 01:41:46 2010
config:

        NAME        STATE     READ WRITE CKSUM
        yawn        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            c0t0d0  ONLINE       0     0     0
            c0t1d0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0
            c0t3d0  ONLINE       0     0     0
            c0t4d0  ONLINE       0     0     0
            c0t5d0  ONLINE       0     0     0
            c0t6d0  ONLINE       0     0     0
            c0t7d0  ONLINE       0     0     0
        logs
          mirror-1  ONLINE       0     0     0
            c2d0s1  ONLINE       0     0     0
            c3d0s1  ONLINE       0     0     0
        cache
          c2d0s3    ONLINE       0     0     0
          c3d0s3    ONLINE       0     0     0

errors: No known data errors

# zpool iostat -v
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
yawn        12.7T  1.81T     26      5  3.20M   482K
  raidz2    12.7T  1.81T     26      5  3.20M   440K
    c0t0d0      -      -      6      1   550K  73.8K
    c0t1d0      -      -      6      1   550K  73.8K
    c0t2d0      -      -      6      1   550K  73.8K
    c0t3d0      -      -      6      1   550K  73.8K
    c0t4d0      -      -      6      1   550K  73.8K
    c0t5d0      -      -      6      1   550K  73.8K
    c0t6d0      -      -      6      1   550K  73.8K
    c0t7d0      -      -      6      1   550K  73.8K
  mirror    11.9M  7.99G      0      0      0  41.4K
    c2d0s1      -      -      0      0      0  41.4K
    c3d0s1      -      -      0      0      0  41.4K
cache           -      -      -      -      -      -
  c2d0s3    46.8G     8M      0      0  16.8K  98.6K
  c3d0s3    46.8G     8M      0      0  16.8K  98.3K
----------  -----  -----  -----  -----  -----  -----

Later if you ever need to, you can remove these cache/slog devices, say you wanted to use the ssd's elsewhere or needed the sata ports for other spinning disks. Cache can be removed from any version, and slog/zil can be removed as long as you're on zpool version 19 or above. You just have to be careful which command you use, like when adding devices as they each have different meaning.

zpool remove - removes a cache device or top level vdev (mirror of logs only). Mirror vdev name comes from the zpool status output.

zpool remove yawn c3d0s3
zpool remove yawn mirror-1

zpool detach - detaches a disk from a mirror vdev (can return it to a single device vdev if you remove the second last disk). Example below detaches c3d0s1 from a mirror, if it's not a mirror the command will error out. Detach works on any mirror vdev, not just logs.

zpool detach yawn c3d0s1

zpool attach - makes a mirror out of an existing vdev (single device, or mirror) Example attaches new device c3d0s1 as a mirror of already present device c2d0s1. Attach also works on any vdev, not just logs.

zpool attach yawn c2d0s1 c3d0s1

After using ZFS for over 2 years now I've come to really appreciate it. There's a definite learning curve to it, but no more than equivalent on Linux. Linux Raid + LVM + + how to resize/reshape and so on. Actually zfs might be simpler, as all the commands are clearly documented in one place and behave all the same. The best part of ZFS is knowing your data is not rotting away on disk, and the very easy incremental replication, and snapshots, and Solaris's smoking fast NFS server, and and and


November geeky catch up

posted by robert
Nov 10

I noticed recently that a google apps account now is a real google account, and can log into most (nearly all) services that a normal google account can log into. So this kicked off the idea of migrating from gmail to google apps mail again. Originally I didn't because of features not present in the apps version, but now I look at it, nearly all the features I care about are across (new contacts tool isn't but I don't use it that often anyway).

Fortunately syncronising most things in and out of google is easy, Data Liberation made all the instructions easy to follow and located in one place. I also only use a few of google's applications anyway; reader, gmail, docs and more recently picasa web. All except mail were very easy to export/import between the accounts, but mail still proves to be a challenge.

This time around (circular reference) I noticed gmail-backup doesn't actually back up all of the email in the account (it misses sent items, and doesn't copy/apply stars). It's linux version needed Python 2.5, I have 2.6 and 3 (Fedora 13) and that wasn't good enough. Running it under wine backed up ok, but resulted in an unhandled exception part way through the restore, so it too was useless. As a last resort I ran it on windows and once it finished restoring, the message counts were wrong. What the hell? BackupGoo wasn't immediately clear that it could only backup and was trial-ware. Lame.

Then I went to consider doing it manually, thunderbirds away! Well that was destined for failure too. It had no way to mass import .eml files (from the backups) and drag-n-droping them in didn't work (on linux – might have on windows). I found an add-in ImportExportTools which was actually quite useful, it exported the whole mailbox to a series of mbox files, which I could then import. Great I was thinking, until I noticed again the message counts were wrong. It wasn't exporting all messages. What the hell?

So then I poked around looking for an imap copy application, and after looking at several, downloaded imapcopy which looked like it was going to be fine. Alas, another red herring, half way through the restore, it dumped an unhandled exception and a stack dump. So I grabbed the code (java) and fixed that bug, ran it again, found a few more issues, fixed them. After a few cycles of that, adding more debug code and logging to help track down where the messages were being duplicated, it's now complete. I'll have to split out the bug fixes from the hacky debug code and submit a patch.


Next up on the geeky bits. Ages ago I replaced my old WRT54GS router with an even older SnapGear router due to the WRT not handling the throughput of my adsl line (22mbit). Then around Christmas last year I bought an Ubiquity Networks RouterStation Pro (board). Due to power supply related difficulties, lack of a case, the need to build OpenWRT trunk (and me failing at that this time) it sat in an anti-static bag for about 10 months.

However one day after coming back from Timor I sat down and decided to get it going. I checked out a clean build tree of OpenWRT (not my ages old one from ye old days which was the cause of my failure) and built a very basic image, successfully. In cleaning up the house and chucking out stuff, I found a power supply which happened to have the same plug (and polarity) as the board needed, but only 12v. Everything I read said the board needs 24v or more for the mini-pci wireless cards to work right. So I checked the power supply's output, and it was about 15v, nice. Even better when the board booted up. Although it already comes with OpenWRT on the board, it is a pretty outdated trunk version which is why I started with building a new image. So I had a quick poke around the now old firmware, noted a few things, tar'd up the root image and flashed my new one to it.

Flashing it was, umm, tricky at first. The board didn't come with much in the way of documentation and it's sort of spread around the net rather than being located in a few obvious places. Fortunately again, the OpenWRT site's guide for this board was mostly spot on.

Quick guide for the lost/interested out there.

  1. Grab the image from the OpenWRT site, you'll want the one for the RouterStation PRO. Snapshot or latest RC (right now, Backfire 10.03.1-rc3) If you're flashing with tftp you'll want the factory image, and I recommend the squashfs version (unless you have specific needs and know what you're doing) So grab: openwrt-ar71xx-ubnt-rspro-squashfs-factory.bin

  2. Connect your computer to the WAN port (or the WAN port to your lan etc)

  3. Configure your network like this: 192.168.1.2/255.255.255.0 (anything but 192.168.1.20 and 192.168.1.1) On linux all I do is bring up an aliased network device: ifconfig eth0:0 192.168.1.2 netmask 255.255.255.0 up This will not affect your normal eth0 connection. YMMV if you have a firewall setup locally.

  4. For peace of mind, leave a ping window running, pinging 192.168.1.20 (use -t if on windows, so it keeps running).

  5. Here's the tricky bit. While holding the reset button, plug in the power to your RouterStation Pro and hold the reset button for at LEAST 10 seconds. You should see the third LED flash a bit and the second one is supposed to come on, but I don't think mine did.

  6. The pinging should succeed here.

  7. Send the firmware image with tftp in binary mode tftp -m binary 192.168.1.20 -c put openwrt-ar71xx-ubnt-rspro-squashfs-factory.bin In theory after flashing the router will reboot and provide dhcp on the switch (the other 3 ports). You could now telnet to 192.168.1.1 as usual, root with no password on the LAN ports. SSH won't work until you set a password, and when a password is set, telnet won't start on boot anymore.

If you're like me, and slightly uncoordinated (or drunk), you can set tftp to retry and leave it running like the ping window as follows (linux only most likely): echo -e "binary\nrexmt 1\ntimeout 60\ntrace\nput openwrt-ar71xx-ubnt-rspro-squashfs-factory.bin\n" | tftp 192.168.1.20

This way as soon as the board is ready for the image, it will be sent.

Once you're on a half recent build of OpenWRT you can flash using the sysupgrade image either from the web gui or from command line, which saves all this mucking around with tftp. Not that I'm knocking tftp, I did build a client and server in C# as an exercise – and it performed quite well.

After some fairly basic setup I was able to swap my routers over and it's been great ever since. I now have wireless again, and now it's 802.11n. Yay.

I guess that'll do for now.


Timor and more

posted by robert
Sep 28

So I recently spent 10 days in Timor Leste, 4 nights camping out and about.

Let me prefix this with I've never been to Indonesia at all (not even Bali) and I guess my only real exposure to the third world would be seeing aspects of it in South Africa over 10 years ago. Seeing from afar, not being part of it.

Overall people were friendly, well meaning and helpful. We didn't see any crime or even traffic accidents (despite the Indonesian style traffic flow). I was there for a 5 day bike race, across the country and back, so I got to see a lot of the countryside, villages and amazing views. During the race everywhere we went had streets lined with locals all cheering us on. My impression was there is a very strong sense of patriotism and unity for the country, despite there still being some rebel factions and it's history being covered in bloodshed. It seems the vast majority are proud to call Timor Leste their home. Before going I wasn't aware of how important Timor Leste was during WW2 especially for us Australians. While there we went the memorial at Dare (which had an amazing view of Dili from the hills), and on our way through Balibo we visited the house where the 5 journalists were killed in 1975.

Initially I guess I was hesitant as to why I was there in the first place, but once we were out on the road dodging cars and motorbikes it all came back to me. The 5 days of the race were awesome, although very tough at times and even though I've done a hard 5 day road ride before, this was something else. We climbed up to over 1900m elevation, we had to walk up rough sections with a grade of >28% and descend single track with gradients over 30%. Battling the heat and humidity on the coast and the wind and rain through the hills we raced on. I'll upload my photos to my gallery and put up some links to the gps data once I've sorted out the details. I also need to update mycyclinglog as it's a bit out of date again.

During my time there I even met President José Ramos-Horta and shook his hand. During his addresses to the riders he was humorous and very down to earth, and after watching Balibo I saw further insight into how strongly he cares for his country. From my limited and outside view he's an excellent choice for a leader. Lets not say anything about Australia's political situation, actually, no just one thing. We tried really hard to tie the election (bah at two party preferred system) and the AFL grand final did manage a tie. Movin right along.

After BikeSnobNYC linked to some minimalists and made fun of their purple tshirts, I read some of their blogs and sort of got inspired. I guess it's something I've been on the edge of for a while and all these various bloggers just say it like it is. So maybe it wasn't a delusion or bad idea, just something I hadn't thought through to a conclusion point. So as a start, I've started on a much needed spring clean and I've been far more ruthless than before with the stuff that's been here but never used in ages.

Quick storage update. I bought some OCZ vertex2 SSD's for use as mirrored ZIL and L2ARC cache on my Solaris based storage box. While mucking around with them I also played with an embedded Solaris build kit, which after some modifications actually works neat for me. I'll probably switch from Eon to my own base image when I install the SSD's into the primary box, a non obvious benefit of this is kernel power management works on base Solaris, but not on Eon (with my controller/drives/whatever). I've been rethinking my storage solution over the last 6 or so months with a goal of drastically reducing power consumption and bumping the performance dramatically too. Ultimately the SSD's will end up in a newer, smaller box with fewer disks, and the big kahuna box will not be on 24x7 anymore (and it's mirror obviously won't be either). Until that happens though, the new Solaris image hopefully will save some power with kernel power management and the SSD's will improve both read and write performance.

Riding off the back of that is a new desktop idea too. Since I switched to Linux full time a few months back, the atom+nvidia ion system has performed well, but not quite well enough for my use. So I'm toying with a real upgrade to that and will consolidate my VM's onto this box, so it'll have lots of ram. In the mean time, I'll try an SSD in the atom and see if that improves it much/at all. Most of the time I think it's IO bound, except when you know it's not - video playback (h264 is 99% ok thanks to the ion chip but everything else is so close but so far). The rest of the time when you know it's CPU bound, there's nothing you can do because although it's a dual core with hyperthreading it's still not capable of out-of-order execution and is only 1.6ghz (clocked to 1.9ghz I think). Still, for general use, it's an awesome box, and uses around 45w from the wall.

Oh no, with this I might be over my quota for words this week. All I've ever wanted was an honest week's pay for an honest day's work.


Zfs experiment continued

posted by robert
Jan 18

So the zfs experiment continues. Upon the release of b129 I set off into the unknown on a voyage of dedupe. Which at first had the promise of lower disk usage, faster IO speeds and a warm fuzzy feeling deep down that you only get from awesome ideas becoming reality. ahem

Most sources say you need more ram, and that is true, what they don't say is how much ram for what size data set, which might be more useful to home users like me. My boxes have 2gb of ram each, and that is not enough for dedupe, no way near. Not if you have a 6 TB of randomish data. I might retry when I get to 8gb ram but not before. You see, if it can't keep the whole of the dedupe table in ram ALL the time, any write to a dedupe enabled volume will result in reads for the rest of the table, or at least seeks. So what I saw was a gradual slowdown while writing to the volume, I was determined to let it finish, to see what savings I would make, and then scrap it due to performance, but after waiting 16 days for the copy, I cancelled it.

The only way I found to even see the contents/size of the dudupe table (DDT) is: zdb -DD which results in an output like this

DDT-sha256-zap-duplicate: 416471 entries, size 402 on disk, 160 in core
DDT-sha256-zap-unique: 47986855 entries, size 388 on disk, 170 in core

DDT histogram (aggregated over all DDTs):

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    45.8M   5.69T   5.66T   5.66T    45.8M   5.69T   5.66T   5.66T
     2     394K   43.0G   40.3G   40.3G     821K   89.0G   83.0G   83.1G
     4    9.90K    527M    397M    402M    47.0K   2.35G   1.76G   1.79G
     8    2.06K    125M   82.4M   83.4M    21.1K   1.20G    795M    806M
    16      391   13.7M   8.54M   8.76M    7.26K    272M    162M    166M
    32       69   1.17M    776K    822K    3.08K   51.3M   32.7M   34.8M
    64       17    522K    355K    368K    1.43K   36.9M   25.1M   26.2M
   128        6    130K      7K   11.2K    1.07K   31.3M   1.50M   2.23M
   256        2      1K      1K   2.48K      833    416K    416K   1.01M
   512        4      2K      2K   4.47K    2.88K   1.44M   1.44M   3.32M
    2K        1     512     512   1.24K    2.79K   1.39M   1.39M   3.46M
 Total    46.2M   5.73T   5.70T   5.70T    46.7M   5.78T   5.74T   5.74T

dedup = 1.01, compress = 1.01, copies = 1.00, dedup * compress / copies = 1.01

Saving's of around 80gb with dedupe and compression (backup box so no real world performance requirement) is just not worth the need for 3-n times the ram and possibly an ssd for the l2arc cache to speed things up. Yep, the suggestion and observed behaviour was to hook up a cheap small (30gb) SSD for cache to accelerate it. I don't mind that so much for a primary but this is my backup/2nd copy box so it's not really ideal. Certainly not for 80gb of savings, or at current prices around $5 of disk.

My second attempt is now underway, this time I've sliced up my data sets into more volumes, and by more that means smaller average size, so this time around 2TB max per volume, which from experience at work I've learned is a good rule of thumb. So now I can enable compress+dedupe on only specific bits, hopefully where the most savings is to be made, and then the rest is just stored raw. This way the savings might be similar, but without the major write speed penalty. I've also realised for the production box if I want screaming performance, I'll throw an ssd on there, but that means more sata ports, which means a major change. I also need to work on power management too.

One thing that has gone right this time, is I'm now using CF->IDE adaptors and booting off that. This way the OS think's it's on a 2gb hdd, so booting doesn't have the complexity of usb boot and also uses less power and doesn't take up a sata port. Of course new boards don't have pata anymore so I might need to get a CF->sata one in future.

Another thing that must be said, Solaris's CIFS server is fast.