Like most, I learn a lot more by doing things wrong before doing them right. Maybe, I can save someone some of my learning pain, I mean curve!

Wednesday, September 8, 2010

FreeBSD ain’t free, if I value my time and include the cost of confusion!

There I said it.  Flame on to all of you FreeBSD idealogues!!!  If you look through the history of this blog, you will see that I have been about addressing my NAS needs for a while, well over a year with FreeNAS running on FreeBSD 7.2 and now with FreeBSD 8.1 rolling my own.  If you look at the timestamp of the entries, you will notice some significant gaps. 

There are several sayings that could be applied during these gaps.  “No news is good news” or “ignorance is bliss” are examples of the positive and the polite; however, to do it justice requires Latin: “non impediti ratione cogitationis.”  In English, this translates to “unimpeded by the thought process.”  Like most things that I know, I can’t take credit for this phrase. I learned about it from a couple of my idols, Click and Clack the Tappet Brothers of Cartalk. 

I now know that I wasn’t even thinking about somethings that could have and should have bothered me.  Unfortunately, now that I know about them, and think I am on a path to addressing them, I am now worrying about what else I do not know!  Read on ...

What have I learned?

1.      If you want to run zfs without worries, run it on some big honking hardware!
2.     Even with big honking hardware, you should still be worried. You will still get yours, just later instead of sooner:)
3.     Reliably running zfs on FreeBSD on commodity, low power hardware requires the sacrifice of millions of brain cells on a weekly if not daily basis!
4.     Always have backups!!
5.     Always have backups of your backups!!!

Gripes first -

I won’t lie and say that I always RTFM, but I RTFM much more than most.  So when I started out of this venture to build the perfect home NAS server, I spent a lot of time in my recliner with my laptop on my bulging belly, both which can be attested to by my wife, considering not only what choice should I make but also how should I make it.  I really thought I had found an appliance approach in FreeNAS with which I would be happy.  Me being happy with IT appliances in general is a non sequitur because I always want to add some things that aren’t done just the way I want.  But FreeNAS got me a long way to where I wanted to go.

Initially FreeNAS was reliable and seemed to be performant, but I had to go and add sabnzbd to download my wife’s, yes it must be her fault :), favorite TV shows from supernews and add Sick-Beard to handle setting up the downloads.  What’s the harm with a couple of little Python programs?  It wasn’t that difficult to hack Python 2.6, sabnzbd and Sick-Beard into my embedded version of FreeNAS.  For brevity’s sake, I’ll skip the details.  Prior to doing this, FreeNAS was running pretty well though I had figured out the FreeBSD 7.2 and my Atom 330 based MB didn’t get along well with hyper-threading enabled.  But following this, I started to notice performance problems when synchronizing my digital photography library from my desktop to the server using a robocopy script that overwrote everything every time, my error, not robocopy.  But I said, what the hey and ignored it.

Then the server started crashing from time to time and I couldn’t figure out why.  I enabled a syslog server, poured through the logs, couldn’t find anything, couldn’t make it crash, ...  I was going crazy trying to figure out what was up.  I reached a point where I was ready to give up on FreeBSD and go back to my tried true reliable Linux and ignore file system goodness until btrfs is ready for use.  But before totally jumping, and having to migrate data, I decide to stick a different CF card into my CF/IDE adapter and install FreeBSD 8.1 RC2.  I decided I could forego the FreeNAS GUI if I could get stability and consistent performance.  Voila, it appeared that I had it, until ... I tried to a duplicate of about 500 GB of data to an external drive to start keeping in a drawer at my office as an offsite backup of the critical files.  When doing this, I started to see big time performance problems and started to be able to crash the server somewhat regularly with a cryptic but key error message about running out of kmem. 

In researching this, I learned the real meaning of some of the tuning values that I had in my /boot/loader.conf – vm.kmem_size_min and vm.kmem_size_max as well as vfs.zfs.arc_min and vfs.zfs.arc_max.  With a little trial and error, I came up with appropriate settings for my little box and eliminated the crashes; however, the transfer performance would drop off as rsync would run over time to replicate my critical data.  Finally, I came across references to problems with Western Digital’s 1.5 TB (WE15EADS) Green drives that I am using. 

The drives have a 4KB physical sector but report 512 Bytes to the BIOS.  So performance drops off on really big writes because zfs on FreeBSD sends 4KB of data to the drive as 8 separate writes of 512 bytes, which requires the firmware in the drive to increase its work load by an estimated factor of 60 (1st 512 Bytes - write 4KB, 2nd 512 Bytes, read 4K, write 4K, ..., 8th 512 Bytes, read 4K, write 4K -- so 4KB of writes become 4KB write + (4KB read + 4KB write)X(4KB/512Bytes - 1) = 60.  The drives built in 32 MB cache helps until it fills and the zfs arc kicks in and then the arc begins to fill.  So all in all, no big deal right?

Actually it is a very big deal if you are writing files to zfs that are larger than your arc plus the size of the buffer on the drive.  And because of the behavior of the zfs arc cache code on FreeBSD, notice I am not calling it a bug because I don’t know enough to point to where it is, the allocated memory is not made available to be re-used at a rate fast enough to sustain the transfer speed and the throughput drops over time. 

You can observe this yourself by executing a copy watching the free memory in top drop while the inactive memory increases.  This is further exacerbated by memory in the wired pool not being marked as inactive quickly enough.  This “appears” to me to indicate that FreeBSD and/or zfs is too aggressive in grabbing memory for caching relative to the rate at which it releases it.  This results in transfer speeds well below those of other operating systems running on the same hardware.

Fortuitously, BSD provides gnop, a drive geometry abstraction layer, to create another lie to offset the lie told to the BIOS (512 Byte sectors instead of 4KB sectors) by the drive.  Unfortunately, this layer is not saved as metadata on the drive so it will not persist through a reboot.  Fortunately, I found a script on a Japanese web site, nothing on the site except for the script and the tests were in english, that I used to create the gnop geometry entries prior to starting zfs.

This significantly increased my performance; however, over time, the transfer rate would still drop because of the aforementioned memory allocation issue, note I didn’t say problem or bug ;).  But, I found another person who created a one line perl command, yes perl to the rescue!!! who needs a stinking snake ;), that tried to allocate an exorbitant amount of memory.  This does trigger the FreeBSD memory management to release the memory and the kernel kills the overreaching little process to boot.  This result in freeing up the memory for re-uses.

So with the gnop geometry implemented and running the perl one liner in cron, I am able to sustain a whopping 9-10 MB/sec sustained transfer rate that pretty much renders the server unusable while big transfers occur.  While this isn’t great, it is much better than getting down to 1-2 MB/sec and crashing!

Fortunately, these big writes don’t occur too often, so most of the time, my little low power box can pump transfer rates on the order of 30 - 40 MB/sec as long as the files don’t exceed about 750 MB, based upon my current tunings.  I am currently in the process of implementing L2ARC using a CF card in a CF/IDE adapter and a ZIL using higher write speed CF card (my cheap version of SSDs).  I am also going to add a third CF on which to place the cache and log directories for sabnzbd and Sick-Beard as well as the SQLite3 database for mediatomb to insure that these applications remain fairly responsive during periods of heavy SMB or rsync usage.  While this sounds expensive, I ordered all of the parts needed from Amazon for slightly under $100.  You could use USB flash drives with similar benefits if you want to go even cheaper.  But I am not always happy with FreeBSD’s performance with USB media so I decided to go the CF/IDE route.

Recommendations

I’ll put out an update in the next couple of weeks to let you know how things go along with what I think is a prudent methodology for others to use in selecting their hardware, configuring their OS and zfs, and in tuning.

If you feel like building a home server using zfs on low power, relatively wimpy software, take 2 aspirin and lay down until the feeling goes away J!  If you can’t resist, here are my top ten tips:

1.     Run zfs on 64-bit capable hardware.
2.     Hyper-threading may be a problem with FreeBSD kernels and some chipsets.  The problems are both performance and stability related. Test, Test, Test!!
3.     Put as much memory as you can in the box, 2 GB minimum >4 GB recommended.
4.     Be careful choosing disks.  Stay away from advanced format disks that don’t honestly report their physical sector size. 
5.     If you use disk that report a different sector size than physical to the BIOS, use gnop to correct
6.     Use raw disks, do NOT partition them
7.     Implement an L2ARC and ZIL using flash or SSD.
8.     Set your vm.kmem_size max to roughly ½ your total physical memory.
9.     Set you vfs.zfs.arc_max to roughly ¾ of the vm.kmem_size.
10.  Hit the file system hard, both read and writes, and monitor your vm.kmem_size and vfs.zfs.misc.arcstats.
11.  BONUS TIP:  Perform tests that match your expected usage pattern so that you aren’t surprised as I was when performing large transfers.

In closing, thanks to sub.mesa for his concise documentation on FreeBSD, zfs and his commentary on WD drives.  Thanks to Brendan Greg for an excellent article on zfs L2ARC. 

While I have little doubt that zfs on FreeBSD is the most performant reliable copy on write filesystem available today without spending large sums of money, I am not sure how long this will be true.  I believe that the FreeBSD release cycle and the conservative nature of its maintainers may actually be working against it users desires in this case. 

The fuse-zfs project seems to have exorcised many of its reliability demons and is now more feature rich with its implementation of zfs (pool version 23) on Linux.  It is still lacking some on performance, but not by a whole lot. 

btrfs appears to be coming along at a pretty fast rate.  Though both are owned/maintained by Oracle, btrfs seems to me to have a life going forward even if Oracle totally shuts down its participation; whereas, zfs’ path past the currently released code seems to be dead outside of Solaris. 

And while these quandaries exist, Microsoft continues to fairly quietly sell Windows Server 2008 with a very tried and tested file system with robust snapshot and performance capabilities. Flame on if you must ye ideologues of ole.  NetApp continues to sell their Filers; and Veritas continues to sell its very expensive solutions. 

I personally believe that it is time for the open source world to put its differences behind them, BSD or Linux, ext or ufs, zfs or btrfs, and pick something that can deliver a robust and performant copy on write filesystem with the right features!  Both zfs and btrfs have similar delivery goals but go about things somewhat different.  In the end, I am an engineer, sigh, I care about what works reliably, what is performant, and what is supportable. 

Thanks for sticking with my griping and complaining this far.  I promise I’ll be better next post ;)

lbe

4 comments:

  1. I didn't read your entire post but...from what I have read on the Internet the WD EADS drives don't use the new 4k sector "Advanced Format" technology, it's the EARS models with 64MB cache that use this. Also, from what I've, any drive with 64MB cache is using the new 4k sector. Now, if I could only get a definitive answer as to whether ESXi 4.1 works with the new 4k sector format off the shelf. There's alot of postings but no sure answers...

    ReplyDelete
  2. Could you P L E A S E point out on how you managed to install Python 2.6 on an embedded FreeNAS installation? And even more precisely, did you manage to run Python 2.6 apps? Since SABnzbd and Sick Beard both don't require Python 2.6. They just run fine on the already installed Python 2.5.4. Thanks in advance!

    ReplyDelete
  3. Haha. Nice post. I've been using FreeNAS for a few weeks now and my experience has been very similar to yours. I'm particularly glad that you posted about the perl script that you used to free up inactive memory. I researched that today and improved my throughput 2-3 times what it was when doing large transfers.

    I hear that a lot of these issues have been addressed in FreeBSD 8.2 and ZFS version 15 included in that release. Sadly, even though FreeNAS 0.8 is marked as a release candidate I don't think it's anywhere as mature a product as 0.7 is. Maybe in due time...

    ReplyDelete
  4. Ugh. So much fail in a single post.

    [quote]
    I am currently in the process of implementing L2ARC using a CF card in a CF/IDE adapter...
    [/quote]

    CF cards have a max read speed of 20 MB/s. Hard drives have at least 100 MB/sec read speed. So you want to implement a SLOWER THAN HDD device as L2Arc? LOL
    Your money is better spent buying more RAM.

    [quote]
    Initially FreeNAS was reliable and seemed to be performant, but I had to go and add sabnzbd to download ... favorite TV shows from supernews and add Sick-Beard to handle setting up the downloads. What’s the harm with a couple of little Python programs? It wasn’t that difficult to hack Python 2.6, sabnzbd and Sick-Beard into my embedded version of FreeNAS.
    [/quote]

    Really? This is a NAS, as in Network Addressed STORAGE. Making a DEDICATED STORAGE HANDLER do other chores is not smart. At all.

    Also, regarding your comments on how ZFS development is dead, I guess that zfs feature flags, data compression on l2arc and the other recent additions to zfs's capabilities are evidence that it is "dead".

    Seriously, research more.

    ReplyDelete

About Me

Houston, Texas, United States
Geek, sometimes its biting the head off of a chicken, sometimes its getting hit in the head while working on something :)

Followers