OpenSolaris? It’s what we say it is, dammit!

My move to San Francisco and my beautiful new daughter haven’t left me much time to blog (or breathe). I have to point out Stephen Lau’s excellent post. So many of the people in the OpenSolaris community owe their livelihood to Sun that it’s difficult to find opinions that go against the group think — most people are, de facto, drinkers of the Sun Kool-Aid. (This is not to say that there’s some evil corporate conspiracy — it’s just people being people and doing things that people do).

I don’t think there was enough community in the Let This Be OpenSolaris decision.

This is a rather obnoxious bump in the road from Sun as a producer of Solaris to Sun as a consumer of OpenSolaris.

OSCON! ZFS!

Here are the slides.

Belatedly, OSCON was great. Meeting the Joyent crew and many of the people involved in OpenSolaris was a real pleasure. The unofficial highlight of the conference was the Sun sponsored OpenSolaris party. Held in the parking garage of the Doubletree Hotel, it was very, very well attended. (Thanks to Jeff Kubina for his excellent photographic record). Generally regarded as the best bash of the conference. Many thanks to Sara Dornsife for making it all happen. The party would have stopped hours earlier without her diligent oversight. The OpenSolaris Advocacy folks don’t often get enough recognition — Sara, you did great! (Well, except for that whole Hoffman table tennis thing).

After my presentation I was taken to task by Val Henson, a former Sun employee that worked on ZFS. She thought that:

  • I didn’t explain the ZIL well enough — I came across with it seeming like journaling
  • My repeated condemnation of using mdb to set vq_max_pending was unwarranted

The ZFS Intent Log maintains a record of system calls that modify a ZFS filesystem plus enough information to recreate those modifications. Records in the ZIL are discarded in a number of circumstances:

  • a DMU transaction group completes and is committed to stable storage
  • a write flagged O_DSYNC completes
  • an fsync() call is completed
  • a ZFS filesystem is successfully unmounted

Basically, any completion-to-disk cycle whether by explicit synchronous request or through the normal persistence of writes to disk causes the ZIL to be flushed. When a system reboots, the ZIL is examined to see if there are any records in it. The presence of records in the ZIL is evidence that the system crashed before committing all of its outstanding writes to stable storage. The records are replayed, ensuring that the actual on-disk data is consistent with what it should be; the ZIL is flushed, and the filesystem is mounted. More details on how this actually happens can be found in the source code. It’s obvious that the ZIL is only going to contain, in all but the most pathological cases, a few seconds worth of transaction data.

vq_max_pending is a field in the vdev_vqueue structure in ZFS. It is also the reason I maintain that ZFS isn’t ideal for deployment on large storage arrays. vq_max_pending — which has a default value of 35 — is the maximum number of IOPs allowed to be queued up against a single leaf vdev. In the case of a leaf vdev being a single spindle, this is an excellent number. In the case of a leaf vdev being a LUN composed of, oh, 40 15K FC disks, this limit is more than a bit low.

It is possible to change the value. (Follow the link. Really. Imagine doing that on 50 servers that each have a few hundred root vdevs. Imagine doing it again every time you add a few LUNs to a server). The iterative testing required to determine an appropriate value isn’t something that I want to schedule time for whenever I make changes to a server’s storage layout. I also don’t want to have to worry about hand-tuning for optimal performance in every use case. It’s one thing when I’m trying to squeeze out every last bps for one critical database. It’s another thing when I need to worry about individually hand-tuned parameters for every server.

Using mdb this way is conceptually similar to poking values into the Linux /proc filesystem. Except for the interface. Using /proc requires the use of echo. While the appropriate mdb twiddling can be scripted to run on boot, it really isn’t quite as simple as the /proc filesystem.

Hopefully this clears up everything that I didn’t explain well enough at OSCON. My slides are here for the taking, though they lack an enormous amount of context.

Top 10 Most Favorite Tweets

People I frequently favorite (I don’t care if you think favorite is a verb or not; I’ll use it as I please) are either writers or should be writers. Funny, insightful, cynical, wise, silly — all squeezed effortlessly into 140 characters. In alphabetical order by author, here are my 10 most favorite favorite tweets. (This is what we call "filler").

  • crystal “I know he thinks you’re fine n stuff, but does he know how to wind you up?” excellent question gwen, one to ask oneself in trying times
  • crystal krissy takes her worries out of her big worry bag and plays with each one frustratedly as if they were small hateful toys
  • Demimundane Doing the booty shake of defiance
  • Demimundane The Interweb will be glad to know that my one-woman interpretive dance of Joyce’s “Ulysses” was the hit of the ball last night. [ed: the mind boggles]
  • GladRagKraken If bitches play Joe Satriani at my wake, I will rise, an unholy specter of wrath and vengeance, to punish responsible parties.
  • monkchips farrell is riding around on his digger saying hello @cote hello cote cote work cote ok
  • Phenobarb Brief nap gives me a second chance to wake up on the wrong side of the bed
  • rebeccashanks In class discussion. The merits of C++. It’s a short topic..
  • hotdogsladies Our Safeway is like The Island of Misfit Toys, but for groceries
  • Yarrow The breeze is full of warmth and mystery today…;) I saw fireflies last night.

Blood of a SysAdmin {I}

Workflow management is one of the most important things for any systems administrator. Hours are long and unpredictable; the job is interrupt driven; you interface with almost every other group in the organization. Having a consistent system for task management is crucial for working effectively.

My first weapon against chaos is the calendar. Something that isn’t on my calendar or task list has no existence for me unless you shove it in front of my face. (Which is likely to irritate me unless it’s an emergency). I have multiple calendars in multiple locations.

 

The obvious question is, “So, Jay, how do you actually manage to keep these things in sync?” Glad you asked. If keeping calendars in sync is difficult, I’m not going to do it. My calendars would become worthless. Work would come to a standstill! The company would come to a grinding halt! Madness! Madness! I’m here to tell you how to cast the madness out of your lives, friends. Embrace the power of OS X! Cling fast to your Macbook! The software you want — that you need — is here for you! And I’m the man who can lead you to it!

 

These three applications allow me to enter an appointment or to-do item on any of my calendars. It allows my coworkers to enter appointments on my Exchange calendar. My wife can enter items on my — or her — Google calendar. Every point of entry will unobtrusively synchronize with every other calendar.

 

Microsoft Exchange serves as a way for me to publish and consume free/busy information with the Exchange using side of the company. Google calendar lets my family and friends see what I’m doing, and it lets my wife and I coordinate our schedules. My crackberry is hip-side information access: event alarms when I’m running around away from the desk and appointment entry while I’m doing the same. It all comes together in iCal, my main interface to the wonderful world of anal-retentive scheduling. Counting the free/busy calendars for coworkers, I have 23 calendars in iCal. I manually enter events into two of them: home and work. The rest are auto-populated by various mechanisms that I’ll talk about in a later post.

 

Spanning Sync does two way synchronization between an arbitrary number of calendars on Google and iCal. You can map any calendar on one end to any calendar on the other end. Recently, it’s started supporting event notifications. Missing Sync for Blackberry does exactly what you think. It works well however since the blackberry only has one calendar, it pushes all events created on the handheld to one specific (yet configurable) calendar in iCal. GroupCal synchronizes Exchange and iCal. It has an excellent system for handling free/busy information, though it can be slightly difficult to use. Always make sure you’ve backed up your iCal database — especially at first — GroupCal has some unintuitive options that can cause iCal to get wiped out.

 

Total cost for the total seamless calendar? $150. Is it worth it? For me, yes.

Evelyn Izzie

The birth of my first child put a bit of a crimp in the blogging. Weighing 8 pounds and 10 ounces (3.93kg), Izzie was born on June 25th at 9:28PM. She’s wonderful and looks just like her old man sans goatee.

RedMonk Clean Sweep

The chaps at RedMonk sweep the top 50 analyst blogger awards! Congratulations to Cote, James, and Steven!


Not that I claim to have actually looked at the methodology or anything like that, but it’s better than a poke in the eye with a sharp stick…unless the RedMonks resorted to bribery or some other nefarious ploy for their rankings.

Storage, part deaux

“So, you know, we think that should work. Let us know if it does!”


We’ve heard that from every vendor we’re evaluating right now. The smallest discrete "island" of storage that’s being tested for our former Plan A is two clustered T2000s fronting three x4500s.


We export seven six-disk LUNs from each x4500 to the T2000s over 10GbE. I have to admit, it looks nice:


10.0.100.210:/tpool/nfsexp 37T 1.0M 37T 1% /zthumper


(ed: RAIDZ on the x4500s and RAIDZ at the T2000 aggregator. Hot spares and the six OS disks. Raw capacity isn’t called raw for no reason).

Unfortunately, there is no perfect solution for implementing a global namespace over an effectively unlimited pool of storage. (At least not that we’ve found — if you’ve got any ideas, please let me know). We’re looking at NFS/ZFS. We’re looking at Polyserve. I’m personally very partial to GPFS. All of these options have their partisans within the company.


What’s interesting is that performance and capacity are relatively far down on the list of items we’re evaluating. The single most important thing we’re looking at is failure scenarios.

  • How is performance degraded when a disk fails?
  • When a thumper fails? A magazine of drives? Two drives?
  • How long does recovery take?
  • How manual is the recovery process?
  • How soon can I get replacement parts on site?
  • How quickly will the failover server take to fail over?
  • What’s the expected MTTDL? Calculated PFR?



Ease of administrative workload and flexibility tie for the second and third most important criteria. Cost and performance duke it out for fourth and fifth place.

“Sun, I love you…please don’t beat me anymore!”

We’re rolling out a large product at work. The most critical component of that product is storage. Rock-solid, high performance, 100% reliable storage in massive quantities. Petabytes of it.

We want ZFS. We want thumpers. We want T2000s. Actually, we wanted those things. Our grandiose dreams have been crushed by the reality of Sun.

Continue Reading »