Here are the slides.
Belatedly, OSCON was great. Meeting the Joyent crew and many of the people involved in OpenSolaris was a real pleasure. The unofficial highlight of the conference was the Sun sponsored OpenSolaris party. Held in the parking garage of the Doubletree Hotel, it was very, very well attended. (Thanks to Jeff Kubina for his excellent photographic record). Generally regarded as the best bash of the conference. Many thanks to Sara Dornsife for making it all happen. The party would have stopped hours earlier without her diligent oversight. The OpenSolaris Advocacy folks don’t often get enough recognition — Sara, you did great! (Well, except for that whole Hoffman table tennis thing).
After my presentation I was taken to task by Val Henson, a former Sun employee that worked on ZFS. She thought that:
- I didn’t explain the ZIL well enough — I came across with it seeming like journaling
- My repeated condemnation of using mdb to set vq_max_pending was unwarranted
The ZFS Intent Log maintains a record of system calls that modify a ZFS filesystem plus enough information to recreate those modifications. Records in the ZIL are discarded in a number of circumstances:
- a DMU transaction group completes and is committed to stable storage
- a write flagged O_DSYNC completes
- an fsync() call is completed
- a ZFS filesystem is successfully unmounted
Basically, any completion-to-disk cycle whether by explicit synchronous request or through the normal persistence of writes to disk causes the ZIL to be flushed. When a system reboots, the ZIL is examined to see if there are any records in it. The presence of records in the ZIL is evidence that the system crashed before committing all of its outstanding writes to stable storage. The records are replayed, ensuring that the actual on-disk data is consistent with what it should be; the ZIL is flushed, and the filesystem is mounted. More details on how this actually happens can be found in the source code. It’s obvious that the ZIL is only going to contain, in all but the most pathological cases, a few seconds worth of transaction data.
vq_max_pending is a field in the vdev_vqueue structure in ZFS. It is also the reason I maintain that ZFS isn’t ideal for deployment on large storage arrays. vq_max_pending — which has a default value of 35 — is the maximum number of IOPs allowed to be queued up against a single leaf vdev. In the case of a leaf vdev being a single spindle, this is an excellent number. In the case of a leaf vdev being a LUN composed of, oh, 40 15K FC disks, this limit is more than a bit low.
It is possible to change the value. (Follow the link. Really. Imagine doing that on 50 servers that each have a few hundred root vdevs. Imagine doing it again every time you add a few LUNs to a server). The iterative testing required to determine an appropriate value isn’t something that I want to schedule time for whenever I make changes to a server’s storage layout. I also don’t want to have to worry about hand-tuning for optimal performance in every use case. It’s one thing when I’m trying to squeeze out every last bps for one critical database. It’s another thing when I need to worry about individually hand-tuned parameters for every server.
Using mdb this way is conceptually similar to poking values into the Linux /proc filesystem. Except for the interface. Using /proc requires the use of echo. While the appropriate mdb twiddling can be scripted to run on boot, it really isn’t quite as simple as the /proc filesystem.
Hopefully this clears up everything that I didn’t explain well enough at OSCON. My slides are here for the taking, though they lack an enormous amount of context.