random complexity

with a side of crazy

Posts Tagged 'project'

Scripted ESX installations

Recently I decided to script the build of my ESX hosts at home, which would enable me to rebuild them easier (if the need arose). The added side effect is you can get identical configurations easily without resorting to host profiles. After doing this I realised it would be of use at work where I'm about to build 9 nearly identical clusters. This post will be more of a brain dump of the whole process for my reference and possible use of others. I'll focus on the enhancements for use at work.

The high level overview is this. PXE boot the ESX installer, with a parameter pointing to the kickstart file. That parameter points to a web server script which produces a host specific kickstart file. The enhancement for use at work, is that first kickstart file is simple but obtains the hosts Dell Service Tag and passes that to the web server to produce the rest of the kickstart file.

PXE boot the installer. I installed a minimal installation of Fedora 20 in a VM and added; dnsmasq, syslinux-tftpboot, nginx, php-fpm. This VM has 2 networks, one (ens192) is connected to the local LAN, and the second (ens224) is the network builds will occur on. Syslinux is on there as it drops the pxelinux binaries and builds out a /tftpboot folder. Nginx is a web server, and php-fpm is installed as I knocked up a quick script for templating the kickstart files in PHP (don't hate me). DNSmasq is a simple DHCP server which can do DNS and TFTP too, so a no brainer for this deployment. I used a very simple configuration for DNSMasq which I put in the fragment directory. This configuration can probably be simpler but I just based it off a working one from my openwrt box (which I already had PXE working on for Linux installers). After creating the config file, enable and start the service.

File: /etc/dnsmasq.d/buildbox.conf


The pxelinux configuration is very similar to syslinux, which is great because ESX uses syslinux for installing and booting installed machines. An easy mistake to make with setting this up is "pxelinux.cfg" is a directory NOT a file. Again I setup a basic configuration based off one I already used so know it works. The file below sets up a simple interactive menu showing the available options; local HDD, ESX installer, ESX kickstart. Note: this shows esx5.1 (for work) but it does work fine with 5.5 (home). Also see the tree structure of the basic TFTP boot area - note pxelinux.cfg is a folder and the other files (supplied by syslinux) are in the tftproot, also note esx is in a sub-folder.

TFTP directory tree

File: /tftpboot/pxelinux.cfg/default

default menu.c32
prompt 0
timeout 3000
ontimeout local


label local
    menu label Local HDD
    kernel chain.c32
    append hd0 0

label esx51
    menu label ESXi-5.1 Installer
    kernel esx51/mboot.c32
    append -c esx51/boot.cfg

label esx51-ks
        menu label ESXi-5.1 Kickstart
        kernel esx51/mboot.c32
        append -c esx51/boot.cfg ks=

The ESX files need to be loaded onto the tftp server too. I keep them in a subdirectory off the tftproot to make it easy to add/change later. Simply copy all files off your esx iso into a location like /tftpboot/esx51 as I used. Then edit the /tftpboot/esx51/boot.cfg file to cater for the changed root dir. The lines that need editing are "kernel=/esx51/tboot.b00" and "modules=". Every file reference needs the path included so add /esx51/ to each file.

That will get you a PXE booting ESX installer, but to make it more useful for kickstarting lets do the rest.

Nginx setup. Setup nginx to use php via php-fpm. This is a very simple config file with any security options removed. On an isolated network it's probably fine but I wouldn't leave this running on any real web server. Also note the web root is /www which is where we'll be putting files. Setup PHP by setting the timezone (avoids an error);

sed -ie "s/;date.timezone =/date.timezone = Etc\/UTC/" /etc/php.ini

Now enable and start the php-fpm service. Once the nginx file is saved, enable and start nginx.

File: /etc/nginx/conf.d/default.conf

server {
        listen       80 default_server;
        server_name buildbox.local;

        root /www/;
        index index.html index.htm index.php;

        location / { }

        location ~ \.php$ {
                include /etc/nginx/fastcgi_params;
                fastcgi_index index.php;
                fastcgi_param SCRIPT_FILENAME \$document_root\$fastcgi_script_name;

I'd recommend testing the web server from another host and make sure php works. Drop a file on there like this to test. If this script produces the phpinfo page then it's working, if not hit the logs and see why not.

File: /www/test.php


Now for the actual kickstart part. I found plenty of good resources for this online, so it wasn't hard to get a working config going pretty quickly allowing me to focus my efforts on specific requirements. For home I use a templated kickstart file, which based off a number passed to the php script I get one host or another - this means my pxelinux menu has entries for each host as the url is slightly different. For work however I wanted to be more efficient than this - due to the much larger number of hosts I didn't want to have heaps of menu options. Fortunately I was able to get the Dell Service Tag (ultimately a short serial number) off the server prior to ESX installation. We track the assets using this number so it's helpful to know service tag 1234XYZ belongs to company ABC and is destined for location JKL or whatever.

The work flow is this:

  • PXELinux menu includes a URL to a simple file /ks.txt. That file is the kickstart file.
  • ESX installer boots (over tftp) and downloads the ks.txt file (over http).
  • The kickstart file (ks.txt) includes a pre-install script to determine the service tag and pull down the rest of the configuration over http.
  • The web server returns a service tag specific kickstart file for a supplied service tag.
  • ESX installer uses the now complete kickstart file to complete the installation

The idea was to use "esxcli hardware platform get" to get the service tag and supply that to the php script. In the outputs below the Dell Service Tag is the serial number line.

# esxcli hardware platform get
Platform Information
   UUID: 0x4c 0x4c 0x45 0x44 0x0 0x4b 0x31 0x10 0x80 0x44 0xb1 0xc0 0x4f 0x43 0x32 0x53
   Product Name: PowerEdge R710
   Vendor Name: Dell Inc.
   Serial Number: 1K1DC2S
   IPMI Supported: true
# esxcli hardware platform get | grep Serial | cut -f2 -d:

Base kickstart file (rev 0):

rootpw TempESXPassword!!

clearpart --firstdisk --overwritevmfs
install --firstdisk --overwritevmfs

#DHCP for installation
network --bootproto=dhcp --addvmportgroup=false --device=vmnic20
#vmnic 20 is pci1-1 (on dell R820)

%include /tmp/fullks.txt

%pre  --interpreter=busybox
#grab the per host config
ST=$(esxcli hardware platform get | grep Serial | cut -f2 -d:)
wget -O /tmp/fullks.txt "${ST}"

I won't go into details about the ks-conf.php script - basically it takes the service tag in, and pulls details out of a csv to produce this hosts complete configuration (all settings, IP's and vswitches). As what usually happens to me, this was too easy and was bound for issues. Once I'd eliminated any obvious issues I got to the point of checking the nginx access log I found out the service tag was coming through blank. Fortunately during the ESX installer you can still get a shell, where I quickly learned esxcli doesn't work, dmidecode isn't present so that was no use BUT the older tools still work, so I had to adjust to use esxcfg-info instead. After a bit of hunting I found the info I needed and using the few tools available in the installer environment ended up with this;

Base kickstart file (rev 1, changed line only):

ST=$(esxcfg-info | grep "Serial Number" | head -1 | tail -c 8)

That worked, and now I was cooking with gas. Other things to note. The base kickstart file's network line is in my case for installation only - DHCP on 192.168.0.x network. This network is still present for the %post script so I was able to download additional packages from the web server for installation later. In the %firstboot section I setup vmk0's target IP for the destination network, and all the other settings necessary. Below is a sample of the resulting templated script out of ks-conf.php. I've replaced all possibly sensitive details and reduced the config to only show the basic configuration (all other vswitches are based off the same template as vSwitch2 only with different vmnic's and vlans). In the interest of readability I've left all my comments in this.

Sample output from ks-conf.php?st=xxxx123

%post  --interpreter=busybox
#Dell openmanage vib
wget -P /vmfs/volumes/datastore1/

%firstboot --interpreter=busybox
# rename local datastore to something more meaningful
vim-cmd hostsvc/datastore/rename datastore1 "auxxxesx1_local"

# network settings
esxcli network ip interface ipv4 set -i vmk0 -t static -I -N
esxcli network ip route ipv4 add -n 'default' -g

# Set DNS and hostname
esxcli system hostname set --fqdn=auxxxesx1.internal
esxcli network ip dns search add --domain.internal
esxcli network ip dns server add --server=
esxcli network ip dns server add --server=
esxcli network ip dns search remove --domain=local #from dhcp
esxcli network ip dns server remove --server= #from dhcp

# Enable SSH and the ESXi Shell
vim-cmd hostsvc/enable_ssh
vim-cmd hostsvc/start_ssh
vim-cmd hostsvc/enable_esx_shell
vim-cmd hostsvc/start_esx_shell

# ESXi Shell availability timeout, the interactive idle time logout, and suppress the shell enabled warnings
esxcli system settings advanced set -o /UserVars/ESXiShellTimeOut -i 3600 #timeout also disables ssh after the timeout
esxcli system settings advanced set -o /UserVars/ESXiShellInteractiveTimeOut -i 3600
esxcli system settings advanced set -o /UserVars/SuppressShellWarning -i 1

cat > /etc/ntp.conf << __NTP_CONFIG__
restrict default kod nomodify notrap noquerynopeer
server ntp.internal
/sbin/chkconfig ntpd on

# Logging
esxcli system syslog config set --logdir /vmfs/volumes/auxxxesx1_local/logs --logdir-unique=true

#disable ipv6
esxcli system module parameters set -m tcpip3 -p ipv6=0
#module renames in esx5.5 tcpip4
#esxcli system module parameters set -m tcpip4 -p ipv6=0

#mgmt network switch vSwitch0
esxcli network vswitch standard uplink add -v vSwitch0 -u vmnic4
esxcli network vswitch standard uplink add -v vSwitch0 -u vmnic20
esxcli network vswitch standard policy failover set -v vSwitch0 -a vmnic4,vmnic20
esxcli network vswitch standard policy failover set -v vSwitch0 --failback yes --failure-detection link --load-balancing portid --notify-switches yes
esxcli network vswitch standard policy security set -v vSwitch0 --allow-forged-transmits yes --allow-mac-change yes --allow-promiscuous no
esxcli network vswitch standard set --cdp-status both --vswitch-name vSwitch0 
#vmk0 is automatically on vSwitch0 

#NFS network switch vSwitch1
esxcli network vswitch standard add -v vSwitch1
esxcli network vswitch standard uplink add -v vSwitch1 -u vmnic0
esxcli network vswitch standard uplink add -v vSwitch1 -u vmnic2
esxcli network vswitch standard policy failover set -v vSwitch1 -a vmnic0,vmnic2
esxcli network vswitch standard policy failover set -v vSwitch1 --failback yes --failure-detection link --load-balancing portid --notify-switches yes
esxcli network vswitch standard policy security set -v vSwitch1 --allow-forged-transmits yes --allow-mac-change yes --allow-promiscuous no
esxcli network vswitch standard set --cdp-status both --vswitch-name vSwitch1 
esxcli network vswitch standard set -v vSwitch1 --mtu 9000

esxcli network vswitch standard portgroup add -v vSwitch1 -p NFS
esxcli network ip interface add -p NFS -i vmk1
esxcli network ip interface ipv4 set -i vmk1 -t static -I -N
#enable vmotion on vmk1
vim-cmd hostsvc/vmotion/vnic_set vmk1

#VMNET1 vSwitch2
esxcli network vswitch standard add -v vSwitch2
esxcli network vswitch standard uplink add -v vSwitch2 -u vmnic13
esxcli network vswitch standard uplink add -v vSwitch2 -u vmnic19
esxcli network vswitch standard policy failover set -v vSwitch2 -a vmnic13,vmnic19
esxcli network vswitch standard policy failover set -v vSwitch2 --failback yes --failure-detection link --load-balancing portid --notify-switches yes
esxcli network vswitch standard policy security set -v vSwitch2 --allow-forged-transmits yes --allow-mac-change yes --allow-promiscuous no
esxcli network vswitch standard set --cdp-status both --vswitch-name vSwitch2
esxcli network vswitch standard portgroup add -v vSwitch2 -p VMNET1
esxcli network vswitch standard portgroup set -p VMNET1 -v 100

esxcli system snmp set --communities=esxcommunity --syscontact="xxxxxx" --syslocation="xxxxxx"
esxcli system snmp set --targets=xxxxxx.internal@162/esxcommunity
#esxcli system snmp set --enable true
#allow all hosts
esxcli network firewall ruleset set --ruleset-id snmp --allowed-all true
/etc/init.d/snmpd restart

#nfs datastores
esxcli storage nfs add --host --share /xxx_vol1 --volume-name xxx_VOL1
esxcli storage nfs add --host --share /xxx_vol2 --volume-name xxx_VOL2

#Dell Openmanage vib
esxcli software vib install --depot=/vmfs/volumes/auxxxesx1_local/OM-SrvAdmin-Dell-Web-7.4.0-1070.VIB-ESX51i.zip

#backup and go into maintenance mode
esxcli system maintenanceMode set -e true

Things to note:

  • The DNS search domain and server from DHCP are removed in %firstboot
  • I'm pxebooting on vmnic20 which is to be a management interface.
  • NFS and vmotion are on the same vswitch, in my case this is because that vswitch is 10Gbit.
  • My hosts all have additional vswitches for different networks (physical lan separation), I've only showed one of them as VMNET1 above.
  • SNMP hasn't been enabled, as my hosts hosts are being installed then shipped not installed in place.
  • For the installer to add the NFS datastores, they have to be available at the time of installation.
  • I haven't assigned licenses at this stage. This could be done easily however I prefer to add them when joining the host to vSphere.

Oh and as per usual, the buildbox VM was also built with kickstart which preconfigures everything as above, and dumps the scripts and templates down - just in case I need to rebuild that too.

That'll do for now. I've got another vmware related post coming soon.

FreeBSD 9.1 virtual wan simulator

For a recent experiment I needed a wan simulator, so I decided to build one. Using FreeBSD makes this very easy, however many of the sites I went to had partial, incomplete, or no longer working examples.

Specifically what I wanted was a vmware virtual machine capable of Layer 2 wan simulation. Layer 2 is easier because theres no routing or subnet changes to mess around with, just simply a single subnet with some hosts further away than others. It had to be a VM as my whole lab environment is virtual, so it makes sense to just put a vm between 2 vswitches and get the restricted bandwidth, increased latency and packet loss.

Now I present to you the FreeBSD 9.1 based VMware hosted layer 2 wan simulator.

First off, the VMware bits.

  • Create a new vswitch for the "far" side of the link. For convenience I keep my LAN on the near side, so only the machines you want remote are at the far end.

Add Networking

  • On the host configuration tab, Networking page, click on Add Networking. The defaults should be ok.

Add Networking

  • Connection type: virtual machine

Add Networking

  • Create a new standard vswitch. Give it whatever label you want, I called mine "FAR" so it's clear what it is. This vswitch has no physical adapters connected. If you have spare ethernet ports (or vlan capable switch) you could connect the far side network to a physical device.

  • Next, on both of your vswitches, enable Promiscious mode. This is so this VM can pass traffic between the switches which isn't heading to or coming from the machines MAC address.

VSwitch Properties

  • On the host configuration tab, Networking page, click Properties on each vswitch

VSwitch Properties

  • Then click edit if Promiscuous mode isn't enabled.

VSwitch Properties

  • and Enable it.
  • Now create a new VM with 2 nics, both Intel e1000. I gave it 256MB ram and a 2GB disk.
  • Guest OS type "other" and "FreeBSD 64bit"
  • It doesn't need this much disk, however I'm not sure how small before the installer will complain. Once installed it uses about 650MB of disk for root.

Now the FreeBSD bits.

  • Perform a base freebsd install. I used the FreeBSD-9.1-RELEASE-amd64-dvd1.iso disk image.
  • Boot up the machine and the installer should boot to a prompt, select Install.
  • For "Distribution select" I unchecked everything as a wansim doesn't need anything.
  • Guided Partitioning, use entire disk, then tab across to Finish and accept and commit the changes.

FreeBSD Partition Editor

  • Set your root password, and for now just configure one of the network interfaces as DHCP.
  • At the system configuration screen, be sure to leave "sshd" enabled.
  • I disabled crash dumps and did not add any users to the system. We'll enable ssh as root for adjusting settings.
  • Exit the installer and go into the manual configuration shell to make additional changes.

  • Use vi to edit /etc/ssh/sshd_config to change the following line:

    #PermitRootLogin no


    PermitRootLogin yes
  • This allows you to ssh in as root. Now exit the shell and let the system reboot.
  • Once it reboots, ssh in as root (you might want to console in as root to run ifconfig to get your ip, or tail your dhcp server logs).
  • Now make the following additions (you could have done this before, but it'll be easier to copy/paste via ssh than the vmware console.)
  • Edit your /etc/rc.conf to change all the ifconfig lines to the following:
    ifconfig_bridge0="addm em0 addm em1 up"
    #dhcp ip
    #static ip
    #ifconfig_em0="inet 10.x.x.x netmask"
  • And at the bottom of /etc/rc.conf add the following:
    #wan sim shaping
    ipfw -f flush
    ipfw -q add pipe 1 ip from any to any
    ipfw pipe 1 config bw 1024Kbit/s delay 100
  • To your /boot/loader.conf add the following (file might not exist to start with.)
  • To your /etc/sysct.conf add the following.
  • Reboot the vm, and now the 2 vswitches should be connected with a 1Mbit link and 100ms delay (200ms round trip time).
  • From the wansim vm, ping your gateway and it should respond with a RTT of double the delay line in rc.conf

For a more complicated and possibly real world setup, say asymmetric bandwidth settings, try this instead in the rc.conf. This assumes em0 is connected to the LAN network, and em1 is connected to the FAR network. This applies the shaping and delay on outbound traffic allowing each direction to be controlled individually.

    ipfw -f flush
    #FAR to LAN is pipe 1
    ipfw -q add pipe 1 ip from any to any out via em0
    ipfw pipe 1 config bw 10240Kbit/s delay 100
    #LAN to FAR is pipe 2
    ipfw -q add pipe 2 ip from any to any out via em1
    ipfw pipe 2 config bw 1024Kbit/s delay 100
    #and to avoid locking ourselves out
    ipfw add 65534 allow ip from any to any

Now if you're lazy like me and want to make this exact setup quicker, at the installer shell stage you can download and run a little script to do all of this for you. This assumes you selected DHCP for em0 in the installer, otherwise that bit won't work.

    fetch http://tuph.net/wansim/wansim_setup.sh
    sh wansim_setup.sh

Going even further, if you wanted to have some packet loss on your links. Say simulating a C Band satellite connection, change the pipe configuration to include "plr 0.05". The number is the percentage of packets to lose as a value between 0 and 1, 0 being no packets dropped, 1 being all packets dropped. So 0.05 would cause about 5% of packets to be dropped. For satellites too the latency is usually between 400 and 700ms RTT, so a delay of 300 in each direction would be fair (note, multihop satellite links would be more again). For asymmetric links, add it to both pipe configurations otherwise you'll only get loss in one direction.

    ipfw pipe 1 config bw 2048Kbit/s delay 300 plr 0.05

It's as easy as that.

Changes and getting rid of stuff again

The events of the past two weeks have made me more seriously consider what life is all about and what it means. The bottom line is life is precious and life is short. You don't know what you've got 'till it's gone. Secrets eat you from the inside and isolation hurts.

It turns out I've been wasting my life and I need to turn this around. Trying to stick to higher principles and beliefs have isolated me and lead me down a river of lies and deceit. (No I'm not a member of a secret religious cult.)

On personal projects I've been trying to do things the right way for the right reasons. I was brought up to always do your best and always be prepared. It turns out just getting things done is better. After all if you never finish anything who cares how well you did bits of it or what you had planned. Results speak for themselves.

You could say its time to come clean and air out the old laundry. However the minimalists would suggest just throwing it out and setting yourself free. Not to quote Fight Club deliberately It's only after we've lost everything that we're free to do anything but that kind of applies really well.

It's time for change. The road will be bumpy and in places there won't even be a trail. They say nothing worth doing is ever easy. Believe me, it's not easy.

That's a fair bit of beating around the bush. Basically it's that time again, when I need to get rid of some more of my stuff that's just taking up space and not being truly appreciated.

Just a quick list of things I need to rid myself of and you (or someone you know) might want, if I don't get any interest on here I'll put them up on gumtree. Prices are all flexible within reason;

  • AMD(ATI) Radeon HD6950 2GB graphics cards (2x currently in crossfire mode, roughly 12 months old) - $250 for both
  • Gigabyte GA-890FXA-UD5 AM3 mainboard with AMD Phenom II X6 1090T Black Edition six-core 3.2Ghz CPU and 8GB DDR3 ram. - $150 combined
  • HTPC microatx pc cases (2x) Antec NSK2400 with 380W psu (one has cardboard box too) - $40 each
  • Harry Potter books (1-4 paperback, 5-7 hardcover) - Free
  • Alex Rider books (8 all paperback) - Free
  • JVC home theatre amp (model RX-7032VSL) - $200
  • A Slim PS2 (chipped), controllers, component cables, bunch of games etc - $offers
  • A Gamecube (smash bros, mario kart, 3 controllers, component cable) - $offers
  • Logitech PS3 guitars (2x) these are the full size wooden ones. - $120 for both
  • A bodyboard and carry bag (for an adult sized person) - $40
  • Lego technic (several large kits, some mid size and a few miscellaneous) List at brickset.com - $offers

There's also a heap more computer gear I need to get rid of in the coming months. It's all old though, so if anyones after anything specific let me know and you're welcome to whats not in use.

Late to Special Ed Class

File server rebuild

So the long planned rebuild of my home storage box had been through a few different revisions over the months. Originally I was pretty set on the backblaze case and design. Aiming more for bulk storage than fast. I even went as far as procuring the 9 port mulipliers for this however never found a cheap enough way to get the bare chassis. When comparing the price that the complete backblaze option would cost it even made little sense for anything other than sheer bulk storage.

Next up I was considering getting an entry level NetApp. As I'm in the industry the pricing I could get was looking pretty good - however it was still going to be a tough sell. Limited expandability due to high cost of disk shelves vs the up front system cost. When you factor in the up front cost including ALL disks over 3 years it started to look not so terrible. But it all fell apart when I thought about how to back it up (my current system is backed up zfs send/recv style) which NetApp can do (SnapMirror) however I couldn't simply splurge for a 2nd one. Also with a pretty firm price floor of about $6000 and not having any option on expansion I just couldn't justify it - too many dollars for too little storage. Sure it'd be fast, easy to manage, reliable and quality, for 2 times the price of a DIY box but unable to back it up. Backups are important, very important.

So after that I decided to go down a road quite a few others have too. Off the shelf 4RU 24bay case and cheap LSI SAS HBA's. The only downside was the limitation of 24 drives. To go beyond that I'd either need to build a complete second box, or buy the disk shelf version and additional HBA's. For now that's a future me problem. Fortunately the case was cheap compared to a backblaze one, and included the hotswap backplanes. In a few ways this is the same as the expansion problem of the NetApp (buy a second or high priced shelves) but at least backing it up is possible (current backup box still works) and because I don't need to buy disks it costs only a third of the NetApp.

Regardless of the overall system design, I'd already selected a mainboard & cpu combination. An Intel CPU obviously, and a server class board to ensure enough sufficiently wide PCIe slots.

So the overall parts list is this (and rough price);

  • Intel Xeon E3 CPU (I got the E3-1230v2 3.3ghz) ($270)
  • Intel S1200BTL server board ($240)
  • 16GB ECC Registered Unbuffered memory (2x8GB, so I can upgrade to 32GB later) ($240)
  • Norco 4224 chassis ($460)
  • Norco replacement fan wall for 120mm fans ($10)
  • 3x 120mm fans (quiet but high flow) ($12 each, $36)
  • 3x Intel SASUC8I HBA's. (LSI 1068 based, each have 2x SFF8087 connectors, so support 8 drives each) ($163 each (inc freight), $489)
  • 6x SFF8087-SFF8087 SAS cables ($60)
  • Corsair HX1000 PSU (modular PSU) (Already had, the newer model is $250)
  • Terminal blocks, molex wiring looms, velcro etc ($30)
  • Subtotal : $2085 (without drives)

I started off by fitting the replacement fan wall and fans. Then I marked and drilled the sides of the rear area to support 3x 2.5" SSD's mounted internally. As I was intending on booting from a USB drive and using rear mounted SSD's for cache.

Next up I worked out how I was going to wire the hot swap backplanes power. Each of these horizontal blades support 4 drives and have 2x standard 4pin molex power connectors. As my power supply has two separate 12V rails I wanted to try to balance these rails as evenly as possible. So I wired up the blades to alternating supplies - even numbered to one, and odd numbered to the other. This resulted in me using some terminal blocks and wiring off to the modular power connectors directly.

Here's the result.

Power cabling

Power rails

After this I fitted and tested the powersupply and fans. Then the mainboard, CPU and RAM were installed. Now it was ready for some initial testing. Then the SAS HBA's and further testing.

Once the SAS HBA's were installed I was able to map out which slot mapped to which device. Fortunately this wasn't hard to figure out and somehow I had 11 spare disks to assist with this mapping out. On this mainboard, the first 3 PCIe slots were direct to the CPU, and the rest were off the chipset, so I used the top 3 slots (which I number 1-3 from CPU working away, see pic further down). What I ended up with was;

HBA to device mapping

I also took the time to reflash the cards from integrated RAID (IR) mode to initiator target (IT) mode as in this mode it's not so fast to kick out a possibly failing disk (which allows the use of green power drives with fewer issues). After a bit of messing around in UEFI to do this I ended up booting the Win7 install disk and using the command prompt there.

HBA's showing IT mode

Now for the software side of things. I'd been working on my own OpenSolaris/OpenIndiana derived NAS distribution for a while, but was also interested in trying OmniOS and SmartOS on the hardware. What I found was SmartOS wasn't the best idea for what I wanted. OmniOS was a good start but I'd want to build my idea of a NAS system on top of it (already on my todo list in fact). So for the time being I'd still run my own system.

Then I tried running this within a VM under ESXi. I'd read of a few people doing this with great success however I was skeptical at first - partly due to concerns over how failure situations would be handled. Would a failing disk cause the system instability or worse? However after playing with it and swapping disks hot, I'm much more confident that passing through the whole PCIe card solves any concerns there. The important thing is you need a mainboard and CPU that support the VT-D feature.

Direct IO

The other concern with running a virtualised file server was around performance and latency. I'm glad to report there is such a minimal difference on this hardware that it's not worth thinking about. I did some benchmarks with NetApp's Simulate IO (SIO) tool which shows nearly identical performance before/after. These were done with a 10 disk raidz2 zpool made up of 500GB WD blue drives spread across 3 HBA's.

The exact command line used was sio_ntap_win32.exe 0 0 64K 1900m 90 50 V:\testfile. The parameters are Read Percentage, Random Percentage, Block size, File Size, Duration (seconds), Thread count, Filename. Using 0 for Read% means only write, and 0 for Random% means sequential. The point of this tool is to simulate IO, so we deliberately use a high number of threads (50) to cause high IO, and a large size to reduce the benefit of memory caching on client and server. For comparison I've included a local SSD run of the same thing. The most relevant figures in here for comparison are IOPS and KB/s each of these are best of two runs.

Baremetal install

Read %:         0
Random %:       0
Block Size:     65536
File Size:      1992294400
Secs:           90
Threads:        50
File(s):        V:\testfile
IOPS:           564
KB/s:           36080
IOs:            50738

Vmware virtual machine install

Read %:         0
Random %:       0
Block Size:     65536
File Size:      1992294400
Secs:           90
Threads:        50
File(s):        V:\testfile
IOPS:           521
KB/s:           33358
IOs:            46910

Single SATA2 Sandforce based SSD (Win7 NTFS)

Read %:         0
Random %:       0
Block Size:     65536
File Size:      1992294400
Secs:           90
Threads:        50
File(s):        C:\Users\robert\Desktop\testfile
IOPS:           2445
KB/s:           156497
IOs:            220074

Direct disk copies from my desktop PC's SSD's were limited only by the network. 112MB/s write speed sustained for 30GB (using 4GB test files) over CIFS. Baremetal and virtualised had the same speed - no difference at all.

Next I had to decide how the cache disks would work. With ESXi on baremetal booting off USB, I had an SSD as a datastore to contain the filers boot disk. To provide the cache disks I had a few options:

  • put them on the SAS cards (losing hot swap bays)
  • attempt a whole disk passthrough in VMware (RDM?)
  • put a datastore on it and assign a large vmdk to the guest
  • or probably a few other options

It didn't look like I could pass through a whole disk off the onboard controllers, so that was out which left me with a datastore layer of overhead or losing hot swap bays.

Then I had to decide on how to expand my zpool. Going into this upgrade I was using some very old hardware (5-7 years old) with drives that were about 2 years old (oldest 5 were 6/feb/2010). There were 10x 2TB WD Green drives in this raidz2 zpool, one with known bad sectors. Earlier in the year when I had purchased an additional two drives due to disk failures that turned out to be a failing mainboard. So I had 12 disks to work with, 1 a little bit dodgy (64kB bad out of 2TB).

Working from the ZFS optimal raid size plan, I decided the next optimal size up from where I was would be one of the following;

  • 2x vdev's made up of 10 disks raidz2. 20 drives total, 4 parity disks (yes I know it's striped parity).
  • 1x vdev made up of 19 disks raidz3. 19 drives total, 3 parity disks.

So if I allow my dodgy disk to be used as a hot spare, the second option gives me a wider stripe of 19 disks, ultimately better protection to multiple disk failures and still 20 disks in the chassis. Finally I decided to put the cache disks onto the hot swap trays as I had 4 bays free. Why not put 3 SSD's in there then. So that's the plan. Right now there's 2 in there with a 3rd going in once it's been reclaimed from it's current machine. The SSD's have been partitioned (GPT) to have a 2GB slice at the front for ZIL, and the rest of the disk for L2ARC. ZIL mirrored, L2ARC not.

Slot drive type

Initial Data Seeding.

To copy my data on I used zfs send/receive via a utility called mbuffer. Mbuffer helps smooth out any drops or bursts in IO on the sending side to help maintain a higher average speed of transfer over the network. In the past I have had some issues with this when sending a whole dataset. This time around however I had no such issues and was able to copy the entire dataset in one continuous operaion.

summary: 12.5 TByte in 39 h 49 min 91.4 MB/s

Now for some final benchmarking (this is with 100GB L2ARC, 2GB ZIL and the 19 disk raidz3 of 2TB WD Green drives)

CIFS: sio_ntap_win32.exe 0 0 64K 1900m 90 50 V:\testfile

Read %:         0
Random %:       0
Block Size:     65536
File Size:      1992294400
Secs:           90
Threads:        50
File(s):        V:\testfile
IOPS:           1112
KB/s:           71191
IOs:            100112

NFS4: ./sio_ntap_linux 0 0 64K 1900m 90 50 /storage/siotest/testfile

Read %:     0
Random %:   0
Block Size: 65536
File Size:  1992294400
Secs:       90
Threads:    50
File(s):    /storage/siotest/testfile 
IOPS:       1342
KB/s:       85915
IOs:        120818

Big test (working file size > memory + l2arc size):

CIFS: sio_ntap_win32.exe 0 0 64K 140g 300 50 V:\testfile2

Read %:         0
Random %:       0
Block Size:     65536
File Size:      150323855360
Secs:           300
Threads:        50
File(s):        V:\testfile2
IOPS:           727
KB/s:           46506
IOs:            217997

NFS4: ./sio_ntap_linux 0 0 64K 140g 300 50 /storage/siotest/testfile2

Read %:     0
Random %:   0
Block Size: 65536
File Size:  150323855360
Secs:       300
Threads:    50
File(s):    /storage/siotest/testfile2 
IOPS:       1503
KB/s:       96197
IOs:        450923

And now for a real world comparison I ran the same tests on an idle (not in production) NetApp FAS2240-2. However as the test machine was not the same I had to perform benchmarks of my system again from this client. It turned out the test machine is a pile of crap when it comes to network load testing.

CIFS to NetApp FAS2240-2 (19 disk RAID-DP aggregate of 600GB 10k SAS disks - tested from a dual core, 6gb ram laptop via a crossover cable) sio_ntap_win32.exe 0 0 64K 1900m 90 50 Z:\testfile

Read %:         0
Random %:       0
Block Size:     65536
File Size:      1992294400
Secs:           90
Threads:        50
File(s):        Z:\testfile
IOPS:           490
KB/s:           31349
IOs:            44082

CIFS to my system, same laptop. sio_ntap_win32.exe 0 0 64K 1900m 90 50 Z:\testfile

Read %:         0
Random %:       0
Block Size:     65536
File Size:      1992294400
Secs:           90
Threads:        50
File(s):        Z:\testfile
IOPS:           414
KB/s:           26520
IOs:            37299

CIFS to the NetApp again, 49GB test size (didn't have time for mkfile to produce a larger file) (volume not deduped, no flash cache, no flash pool, controller has 6GB RAM, 768MB NVMEM). sio_ntap_win32.exe 0 0 64K 49g 300 50 Z:\testfile2

Read %:         0
Random %:       0
Block Size:     65536
File Size:      52613349376
Secs:           300
Threads:        50
File(s):        Z:\testfile2
IOPS:           775
KB/s:           49609
IOs:            232529

CIFS to my system, same laptop, large file test (140gb - unfortunately I didn't test with a 49gb file to equal comparison however that could have fit in L2ARC so wouldn't have been fair anyway) sio_ntap_win32.exe 0 0 64K 140g 300 50 Z:\testfile2

Read %:         0
Random %:       0
Block Size:     65536
File Size:      150323855360
Secs:           300
Threads:        50
File(s):        Z:\testfile2
IOPS:           322
KB/s:           20616
IOs:            96639

CIFS to my system, from original desktop pc (showing just how crap that laptop is - tests run on same day as the laptop tests for a control sample) sio_ntap_win32.exe 0 0 64K 140g 300 50 Z:\testfile2

Read %:         0
Random %:       0
Block Size:     65536
File Size:      150323855360
Secs:           300
Threads:        50
File(s):        Z:\testfile2
IOPS:           854
KB/s:           54676
IOs:            256301

Interpreting the SIO results can be a bit of dark voodoo. Unfortunately I wasn't able to test the NetApp with a more realistic system - the laptop is clearly crap achieving only 37% of the IOPS my desktop could achieve over the same network. Ignoring the crap laptop for now, this shows that the NetApp is clearly superior (as what should be expected) however my a much smaller margin that I had expected. On the small test (which would fit in the ram of both systems) the NetApp achieves 18% more IOPS (and throughput). For the large test the gap widens dramatically (however test sizes were different). I'd be willing to bet the NetApp had much more headroom available for load than my system did - of course this wouldn't be visible with such a crap test machine. Due to this I think these tests are flawed and totally useless, apart from proving that my work laptop fails at networking.

One thing I did notice while running these tests which I'd never seen before, was large use of the ZIL. Previously when I had a mirrored ZIL on SSD's I'd allocated 8GB for it, however I'd never seen it above about 200MB ever. I based 8GB on the old "how much data could you ingest in 30 seconds, and double it". Allowing for a maxed out 1Gbit interface, 8GB seemed a good number. However I never saw it anyway near used. So this time around I worked out a 2GB number from a more conservative "how much data could you ingest in 8 seconds, doubled", and working from 125MB/s. 8 seconds because the default flush interval is 5 seconds. In practice when writing flat out the disks are all only hit for a burst every 5 seconds. Part of this ZIL sizing comes from my NetApp experience where the NVRAM/NVMEM performs ultimately the same function (but is battery backed for power loss/crash consistency). Only the biggest NetApp system has 8GB NVRAM and it can easily write from a filled 10GbE interfaces out to over 1000 disks. Consider the FAS2240 I've been comparing it to, that has 768MB (which also is halved if used in an HA pair because it's a mirror of the partner's NVMEM too). This suggests I might be around the right ballpark even though the comparison is not totally apples:apples.

During the large file NFS SIO tests above I ran a quick zpool iostat -v 5 300 and spotted ZIL usage above 1.5GB! Fortunately it didn't stay there and hovered around 1GB for most of the test. Perhaps 2GB is close to correct, if not slightly too small for this system? Following is the zpool iostat while running NFS SIO tests;

                 capacity     operations    bandwidth
pool          alloc   free   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
marlow        16.0T  18.5T     96  1.68K  12.1M   146M
  raidz3      16.0T  18.5T     96    605  12.1M  70.2M
    c5t2d0        -      -     79     47   645K  4.64M
    c5t3d0        -      -     79     75   665K  4.63M
    c5t4d0        -      -     79     48   655K  4.65M
    c5t5d0        -      -     79     69   653K  4.64M
    c5t6d0        -      -     78     90   646K  4.63M
    c5t7d0        -      -     79     64   649K  4.64M
    c6t1d0        -      -     79     47   653K  4.65M
    c6t2d0        -      -     79     45   658K  4.65M
    c6t3d0        -      -     79     45   660K  4.65M
    c6t4d0        -      -     80     45   658K  4.65M
    c6t5d0        -      -     79     45   658K  4.65M
    c6t6d0        -      -     79     45   663K  4.65M
    c6t7d0        -      -     79     45   651K  4.65M
    c7t1d0        -      -     78     51   643K  4.64M
    c7t2d0        -      -     78     47   646K  4.65M
    c7t3d0        -      -     78     71   644K  4.64M
    c7t4d0        -      -     79     72   653K  4.63M
    c7t5d0        -      -     78     45   650K  4.65M
    c7t6d0        -      -     79     47   656K  4.65M
logs              -      -      -      -      -      -
  mirror      1.51G   485M      0  1.09K      0  76.1M
    c6t0d0s0      -      -      0  1.09K      0  76.1M
    c7t0d0s0      -      -      0  1.09K      0  76.1M
cache             -      -      -      -      -      -
  c6t0d0s1    53.9G      0     32    142  4.03M  17.7M
  c7t0d0s1    53.9G      0     33    135  4.14M  16.8M
------------  -----  -----  -----  -----  -----  -----

And for the overall happy snaps. This is a nearly finished internal shot of the rear part of the case, showing the mainboard, two cooling fans (which now have 50ohm resistors in series to slow them down) and the nearly finished cable routing. The PCI cards are HBA1, HBA2, HBA3 from top to bottom. On the Intel SASU8CI card, the connector closest to the mainboard is SAS ports 0-3, and the other connector is 4-7. Since this photo was taken I've also added 2 more 8GB memory modules (taking it to 32GB total), the Intel Remote Management Module (RMM) (which gives me remote console, remote cdrom/usb capability) and an Intel Quad port ethernet card.

Filer Overview

Front view showing the UPS too. Despite being brand new I do have a failed drive presense LED on bay 15 (4th down on left) which I need to find out about replacing.

Filer Front

Yikes, this has turned into quite a big write up. In a future post I'll go into more detail of the software side particularly my custom openindiania thing.

What would happen if I took the red and blue pill at the same time

Copyright © 2001-2016 Robert Harrison. Powered by hampsters on a wheel. RSS.