Like most, I learn a lot more by doing things wrong before doing them right. Maybe, I can save someone some of my learning pain, I mean curve!

Monday, September 21, 2009

zfs de-duplication has broken my heart, or ...

more accurately, my lack of attention to detail has broken my heart.  As I stated in my post on the FreeNAS forum "zfs de-duplication - is it working?" - "I absolutely hate it when reality doesn't match my pre-conceived ideas!!"

I'll blame it on the zfs zealots, surely it wasn't me.  Surely these fire-brand wielding zfs prosolytes are to blame for connecting my wants, I mean needs, up to future features and not letting me see that the error of my ways, my desire for block level de-duplication, was/is nothing but vaporware -- at least for now :(

While headway is being made on de-duplication for zfs, it is somewhere between alpha and beta land on Solaris (or OpenSolaris, or !@#$Solaris land, ...)  Block level de-duplication is no where near being part of zfs version 6 currently implemented in FreeBSD 7.  It isn't event part of version 13 that will be available in the next major FreeBSD release.

I was torn with what to do.  I considered going to Microsoft Windows Home Server and getting at least file level de-duplcation or possibly cracking the piggy bank and going to Windows Server 2008.  But in the end, after doing some scribbling on the back of a napkin and figuring out that I would be money ahead to buy another drive, re-build my array, and be OK (defined as enough space without block level de-duplication) for another 12 - 18 months, I decided to stay with FreeNAS, for now, until something better comes along, until I get bored, err, I'm rambling again.

Stay tuned until next time, when I discuss the trials and travails of getting some CF to IDE adapters to work!

And since it is NCAA Football season here in the States -- GEAUX TIGERS!!!

Bye for now,

lbe

Wednesday, September 2, 2009

Let's Tune 'er Up!!

I've had a little time since my last post to work on the tuning. Please consider this some where between coarse to medium tuning and certainly not fine tuning!

As I stated in a previous post, using the 0.71RC1 straight install and enabling nothing more "large read/write" and "use sendfile" in CIFS, I was able to achieve transfer rates of approximately 17 MB/sec. Performance with ftp was actually lower, ~9-10 MB/sec. In order to tune a system, one must know what the components can do. The components in this case were the raw drives, the zfs partition and samba/cifs. The tests of these components and their results are shown below.

Disk Test
I ran diskinfo from an ssh console against the first drive in my array (all three drives are the same - 1.5 TB WD Caviar Green WD15EADS).

nas01:/# diskinfo -tv ad4
ad4
512 # sectorsize
1500301910016 # mediasize in bytes (1.4T)
2930277168 # mediasize in sectors
2907021 # Cylinders according to firmware.

16 # Heads according to firmware.
63 # Sectors according to firmware.
ad:WD-WCAVY0511783 # Disk ident.

Seek times:
Full stroke: 250 iter in 7.376821 sec = 29.507 msec
Half stroke: 250 iter in 5.211402 sec = 20.846 msec
Quarter stroke: 500 iter in 8.364027 sec = 16.728 msec
Short forward: 400 iter in 3.197826 sec = 7.995 msec
Short backward: 400 iter in 3.506082 sec = 8.765 msec
Seq outer: 2048 iter in 0.774789 sec = 0.378 msec
Seq inner: 2048 iter in 0.571217 sec = 0.279 msec

Transfer rates:
outside: 102400 kbytes in 1.051929 sec = 97345 kbytes/sec
middle: 102400 kbytes in 1.142268 sec = 89646 kbytes/sec
inside: 102400 kbytes in 1.950994 sec = 52486 kbytes/sec

Clearly, the disk performance is not the cause of the bottle necks since the minimum disk transfer was 3.5 time or more faster than my sustained transfer rate.

zfs Partition
I used dd to create a 1GB file using two different block sizes, 1 & 64 KB.


nas01:/mnt/vdev0mgmt# dd if=/dev/zero of=mytestfile.out bs=1K count=1048576
1048576+0 records in
1048576+0 records out
1073741824 bytes transferred in 47.028245 secs (22831850 kbytes/sec)

nas01:/mnt/vdev0mgmt# dd if=/dev/zero of=mytestfile.out bs=64K
count=16384
16384+0 records in
16384+0 records out
1073741824 bytes transferred in 11.612336 secs (92465619 bytes/sec)
Again, both of these cases exceeded my test case though only barely with a 1KB block size. So this is not the culprit.

Network Transfer
I use iperf to test transfer rates in both directions, from server to workstation and vice a versa. The work station used is a Quad Core Intel with 8 GB of RAM. The memory use during the tests never exceeded 6 GB eliminating any workstation disk interaction. I ran iperf with two transaction size, 8 and 64 KB. The results are:

C:\>iperf -l 8K -t 30 -i 2 -c 192.168.2.5
------------------------------------------------------------
Client connecting to 192.168.2.5, TCP port 5001
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[148] local 192.168.2.207 port 57889 connected with 192.168.2.5 port 5001
[ ID] Interval Transfer Bandwidth
[148] 0.0-30.0 sec 984 MBytes 275 Mbits/sec

C:\>iperf -l 64K -t 30 -i 2 -c 192.168.2.5
------------------------------------------------------------
Client connecting to 192.168.2.5, TCP port 5001
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[148] local 192.168.2.207 port 57890 connected with 192.168.2.5 port 5001
[ ID] Interval Transfer Bandwidth
[148] 0.0-30.3 sec 1.70 GBytes 482 Mbits/sec

Again, both of these cases exceeded my test case significantly but did show that transaction size has a lot to do with network efficiency.

samba/cifs Test
I used the Microsoft RoboCopy utility to copy a 1GB from and to the NAS servers from my Windows workstation. The results are:

W:\tmp>robocopy . c: temp.file (READ)
Speed : 19140465 Bytes/sec.
Speed : 1095.226 MegaBytes/min.

Ended : Wed Sep 02 17:57:30 2009

This test more or less approximated my original test though it was slightly faster.

Conclusions:
  1. Neither the hard drives, the file system nor the network were major contributors to the bottleneck.
  2. The bottle necks then are "probably" in the kernel (stack, ipc, filesystem and network drivers) and in samba/cifs

Being basically lazy, and definitely not a good scientist (sorry teachers :( ), I surfed the web and found some tried and true tunings for samba/cifs as well as some items that seem to make sense for the kernel. Note, that testing has shown this configuration works for my server. I think the samba/cifs settings will likely help on any server as they have for me over the years across multiple Linux and BSD distributions. The kernel tunings are likely to have a heavy depency on the hardware that I use, namely the 1.6 GHz dual-core Atom 330, the Intel chipset (945GC northbridge and ICH7 southbridge) and Realtek NIC (RTL8111C) built into the MS-9832 motherboard and its 2 GB of RAM. If you hardware is too disimilar, you will "definitely" need to validate values on your own.

Here's what I did to tune my server.

samba/cifs tweaks
I added the following two lines to the auxillary parameters on the services/cifs configuration page.

max xmit = 65535
socket options = TCP_NODELAY IPTOS_LOWDELAY SO_SNDBUF=65535 SO_RCVBUF=65535

I also set the send and receive buffers to 65535 to insure that is what they are.

kernel tweaks
I harvested my kernel tunings from multiple locations with references to their source embedded as remarks below. These additons were made to my /cf/boot/loader.conf file since I am booting from a USB flash drive. I used the advanced file editor in the WebGUI to make these changes since it takes care of mounting the flash drive read write and then resets it to read only
.

# http://acs.lbl.gov/TCP-tuning/FreeBSD.html
kern.ipc.shmmax=67108864
kern.ipc.shmall=32768
#
http://harryd71.blogspot.com/2008/10/tuning-freenas-zfs.html
vm.kmem_size_max="1024M"
vm.kmem_size="1024M"
vfs.zfs.prefetch_disable=1
#
http://wiki.freebsd.org/ZFSTuningGuide
vfs.zfs.arc_max="100M"
# ups spinup time for drive recognition
hw.ata.to=15
# System tuning - Original -> 2097152
kern.ipc.maxsockbuf=16777216
# System tuning
kern.ipc.nmbclusters=32768
# System tuning
ern.ipc.somaxconn=8192
# System tuning
kern.maxfiles=65536
# System tuning
kern.maxfilesperproc=32768
# System tuning
net.inet.tcp.delayed_ack=0
# System tuning
net.inet.tcp.inflight.enable=0
# System tuning
net.inet.tcp.path_mtu_discovery=0
#
http://acs.lbl.gov/TCP-tuning/FreeBSD.html
net.inet.tcp.recvbuf_auto=1
#
http://acs.lbl.gov/TCP-tuning/FreeBSD.html
net.inet.tcp.recvbuf_inc=16384
#
http://acs.lbl.gov/TCP-tuning/FreeBSD.html
net.inet.tcp.recvbuf_max=16777216
# System tuning
net.inet.tcp.recvspace=65536
#
http://acs.lbl.gov/TCP-tuning/FreeBSD.html
net.inet.tcp.rfc1323=1
#
http://acs.lbl.gov/TCP-tuning/FreeBSD.html
net.inet.tcp.sendbuf_auto=1
#
http://acs.lbl.gov/TCP-tuning/FreeBSD.html
net.inet.tcp.sendbuf_inc =8192
# System tuning
net.inet.tcp.sendspace=65536
# System tuning
net.inet.udp.maxdgram=57344
# System tuning
net.inet.udp.recvspace=65536
# System tuning
net.local.stream.recvspace=65536
# System tuning
net.local.stream.sendspace=65536
#
http://acs.lbl.gov/TCP-tuning/FreeBSD.html
net.inet.tcp.sendbuf_max=16777216

While I originally had the intent of testing the impact of the individual settings, I quickly grew bored of the reboots and rigor required (hence my reason for choosing an engineering vs. a scientific career :)). I can clearly say, you "must" disable the zfs pre-fetch in order to get read rates up to the levels that I have achieved.

Tuned Results!
My first test were to verify that I achieved my end goal of having read and write rates in excess of 35 MB/sec. My goal was achieved. Yeah!!!


C:\>robocopy w: . temp.file (READ)
Speed : 39179078 Bytes/sec.
Speed : 2241.844 MegaBytes/min.


C:\>robocopy . w: temp.file2 (WRITE)
Speed : 35574390 Bytes/sec.
Speed : 2035.582 MegaBytes/min.

While the previous test is verification of my goal, I wanted to see if the changes to the kernel network configuration changed the base network performance. It did, significantly.

C:\>iperf -l 8k -t 30 -i 2 -c 192.168.2.5
------------------------------------------------------------
Client connecting to 192.168.2.5, TCP port 5001
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[148] local 192.168.2.207 port 58332 connected with 192.168.2.5 port 5001
[ ID] Interval Transfer Bandwidth
[148] 0.0-30.0 sec 985 MBytes 276 Mbits/sec

C:\>iperf -l 64k -t 30 -i 2 -c 192.168.2.5
------------------------------------------------------------
Client connecting to 192.168.2.5, TCP port 5001
TCP window size: 8.00 KByte (default)
------------------------------------------------------------
[148] local 192.168.2.207 port 58334 connected with 192.168.2.5 port 5001
[ ID] Interval Transfer Bandwidth
[148] 0.0-30.3 sec 1.89 GBytes 537 Mbits/sec

nas01:/# iperf -l 8k -t 30 -i 2 -c 192.168.2.207
------------------------------------------------------------
Client connecting to 192.168.2.207, TCP port 5001
TCP window size: 65.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.2.5 port 55338 connected with 192.168.2.207 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.1 sec 2.11 GBytes 601 Mbits/sec

nas01:/# iperf -l 64k -t 30 -i 2 -c 192.168.2.207
------------------------------------------------------------
Client connecting to 192.168.2.207, TCP port 5001
TCP window size: 65.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.2.5 port 49701 connected with 192.168.2.207 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.1 sec 2.12 GBytes 604 Mbits/sec

The network tuning increased the small block size read rate from 275 to 601 Mbps and the large block size read rate from 482 to 537 Mbps. I did not capture the screen on my write tests, but the results were approximlately 550 Mbps untuned and are now 601 Mbps for the small block and 604 Mbps for the large block.

As I sleepily look back and what I have done, I realize that I have not tested the NAS server to the point of determining the effect of filling the cache since I have tested with files size of 1 GB and have 2 GB of RAM in the server. That will have to wait for another day.

Again, as I warn earlier, you "must" test to make sure that these tweaks are amenable to your hardware. If you are running 32 bit version of FreeNAS, there are many more kernel tweaks needed. If you are running with less memory, you will need to reduce some of the allocations. If you have a much larger server with many more clients, you will need to increase the allocations and probably want to have a better NIC than my bargain basement Realtek.

Realtek takes a beating in many of the NAS and network centric forums; however, depending upon the usage patternse, the Realtek can be a more than capable GigE NIC as this testing shows. It just bears getting the tuning right!!!

Bye for now, more errors are coming this way!!!

About Me

Houston, Texas, United States
Geek, sometimes its biting the head off of a chicken, sometimes its getting hit in the head while working on something :)

Followers