Planet LILUG

December 08, 2014

Josef "Jeff" Sipek

Debugging with mdb

Recently, Theo Schlossnagle posted two interesting articles about debugging on Illumos using mdb. They are MDB, CTF, DWARF, and other angelic things, and mdb custom dmods.

by JeffPC at December 08, 2014 05:41 PM

December 06, 2014

Josef "Jeff" Sipek

Inline Assembly & clang

Recently I talked about inline assembly with GCC and clang where I pointed out that LLVM seems to produce rather silly machine code. In a comment, a reader asked if this was LLVM’s IR doing this or if it was the machine code generator being silly. I was going to reply there, but the reply got long enough to deserve its own post.

I’ve dealt with LLVM’s IR for a couple of months during the fall of 2010. It was both interesting and quite painful.

The IR is at the Wikipedia article: single static assignment level. It assumes that stack space is cheap and infinite. Since it is a SSA form, it has no notion of registers. The optimization passes transform the IR quite a bit and at the end there is very little (if any!) useless code. In other words, I think it is the machine code generation that is responsible for the unnecessary stack frame push and pop. With that said, it is time to experiment.

Using the same test program as before, of course:

#define _KERNEL
#define _ASM_INLINES
#include <sys/atomic.h>

void test(uint32_t *x)

Emitting LLVM IR

Let’s compile it with clang passing in the -emit-llvm option to have it generate test.ll file with the LLVM IR:

$ clang -S -emit-llvm -Wall -O2 -m64 test.c

There is a fair amount of “stuff” in the file, but the relevant portions are (line-wrapped by me):

; Function Attrs: nounwind
define void @test(i32* %x) #0 {
  tail call void asm sideeffect "lock; incl $0",
    "=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
  ret void

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
  "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"
  "no-infs-fp-math"="false" "no-nans-fp-math"="false"
  "stack-protector-buffer-size"="8" "unsafe-fp-math"="false"
  "use-soft-float"="false" }

LLVM’s IR happens to be very short and to the point. The function prologue and epilogue are not expressed as part of IR blob that gets passed to the machine code generator. Note the function attribute no-frame-pointer-elim being true (meaning frame pointer elimination will not happen).

Now, let’s add in the -fomit-frame-pointer option.

$ clang -S -emit-llvm -Wall -O2 -m64 -fomit-frame-pointer test.c

Now, the relevant IR pieces are:

; Function Attrs: nounwind
define void @test(i32* %x) #0 {
  tail call void asm sideeffect "lock; incl $0",
    "=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
  ret void

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
  "no-frame-pointer-elim"="false" "no-infs-fp-math"="false"
  "no-nans-fp-math"="false" "stack-protector-buffer-size"="8"
  "unsafe-fp-math"="false" "use-soft-float"="false" }

The no-frame-pointer-elim attribute changed (from true to false), but the IR of the function itself did not change. (The no-frame-pointer-elim-non-leaf attribute disappeared as well, but it really makes sense since -fomit-frame-pointer is a rather large hammer that just forces frame pointer elimination everywhere and so it doesn’t make sense to differentiate between leaf and non-leaf functions.)

So, to answer Steve’s question, the LLVM IR does not include the function prologue and epilogue. This actually makes a lot of sense given that the IR is architecture independent and the exact details of what the prologue has to do are define by the ABIs.

IR to Assembly

We can of course use llc to convert the IR into real 64-bit x86 assembly code.

$ llc --march=x86-64 test.ll
$ gas -o test.o --64 test.s

Here is the disassembly for clang invocation without -fomit-frame-pointer:

    test:     55                 pushq  %rbp
    test+0x1: 48 89 e5           movq   %rsp,%rbp
    test+0x4: f0 ff 07           lock incl (%rdi)
    test+0x7: 5d                 popq   %rbp
    test+0x8: c3                 ret    

And here is the disassembly for clang invocation with -fomit-frame-pointer:

    test:     f0 ff 07           lock incl (%rdi)
    test+0x3: c3                 ret    


So, it turns out that my previous post simply stumbled across the fact that GCC and clang have different set of optimizations for -O2. GCC includes -fomit-frame-pointer by default, while clang does not.

by JeffPC at December 06, 2014 03:51 PM

Working with Wide Characters

Two weekends ago, I happened to stumble into a situation where I had a use for wide characters. Since I’ve never dealt with them before, it was an interesting experience. I’m hoping to document some of my thoughts and discoveries in this post.

As you may have guessed, I am using OpenIndiana for development so excuse me if I happen to stray from straight up POSIX in favor of Illumos-flavored POSIX.

The program I was working with happens to read a bunch of strings. It then does some mangling on these strings — specifically, it (1) converts these strings between Unicode and EBCDIC, and (2) at times it needs to uppercase a Unicode character. (Yes, technically the Unicode to EBCDIC conversion is lossy since EBCDIC doesn’t have all possible Unicode characters. Practically, the program only cares about a subset of Unicode characters and those all appear in EBCDIC.)

In the past, most of the code I wrote dealt with Unicode by just assuming the world was ASCII. This approach allows UTF-8 to just work in most cases. Assuming you don’t want to mangle the strings in any major way, you’ll be just fine. Concatenation (strcat), ASCII character search (strchr), and substring search (strstr) all work perfectly fine. While other functions will do the wrong thing (e.g., strlen will return number of bytes, not number of characters).

Converting an ASCII string to EBCDIC is pretty easy. For each input character (aka. each input byte), do a lookup in a 256-element array. The output is just a concatenation of all the looked up values.

This simple approach falls apart if the input is UTF-8. There, some characters (e.g., ö) take up multiple bytes (e.g., c3 b6). Iterating over the input bytes won’t work. One way to deal with this is to process as many bytes as necessary to get a full character (1 for ASCII characters, 2–6 for “non-ASCII” Unicode characters), and then covert/uppercase/whatever it instead of the raw bytes. This sort of hoop-jumping is necessary whenever one wants to process characters instead of bytes.


Another way to deal with this is to store the string as something other than UTF-8. I took this approach. When the program reads in a (UTF-8) string, it promptly converts it into a wide character string. In other words, instead of my strings being char *, they are wchar_t *. On my system, wchar_t is a 32-bit unsigned integer. This trivially makes all Unicode characters the same length — 32 bits. I can go back to assuming that one element of my string corresponds to a single character. I just need to keep in mind that a single character is not one byte. In practice, this means remembering to malloc more memory than before. In other words:

wchar_t *a, *b;

a = malloc(MAX_LEN);                   /* WRONG */
b = malloc(sizeof(wchar_t) * MAX_LEN); /* CORRECT */

Uppercasing a character becomes just as easy as it was with plain ol’ ASCII. For example, to uppercase the $i^{th}$ letter in a string:

void uppercase_nth(wchar_t *str, int i)
	str[i] = toupper(str[i]);

There are however some downsides. First and foremost, if you are dealing mostly with ASCII, then your memory footprint may have just quadrupled. (In my case, the program is so small that I don’t care about the memory footprint increase.) Second, you have to deal with a couple of “silly” syntax to make the (C99) compiler realize what it is you are attempting to do.

const wchar_t *msg = L"Lorem ipsum";
const wchar_t *letter = L'x';

“str” functions

Arguably, the most visible change involves the “str” functions. With plain old ASCII strings, you use functions like strlen, strcpy, and strcat to, respectively, get the length, copy a string, and concatenate two strings. These functions assume that each byte is a character and that the string is terminated by a null (8-bit 0) so they do not work in the world of wide characters. (Keep in mind that since ASCII consists of characters with values less than 128, a 32-bit integer with that value will have three null bytes in most characters (assuming ASCII text). On big endian systems, you’ll end up with the empty string, while on little endian systems you’ll end up with a string consisting of just the first character.) Thankfully, there are alternatives to the “str” functions that know how to deal with wide character strings — the “ws” functions. Instead of using strlen, strcpy, and strcat, you want to call wslen, wscpy, and wscat. There are of course more. On Illumos, you can look at the wcstring(3c) manpage for many (but not all!) of them.

printf & friends

Manipulating strings solely with the “str” functions is tedious. Often enough, it is so much simpler to reach for the venerable printf. This is where things get really interesting. The printf family of functions knows how to convert between char * strings and wchar_t * strings. First of all, let’s take a look at snprintf (the same applies to printf and sprintf). Here’s a simple code snippet that dumps a string into a char array. The output is char *, the format string is char *, and the string input is also char *.

char output[1024];
char *s = "abc";

snprintf(output, sizeof(output), "foo %s bar\n", s);

One can use %ls to let snprintf know that the corresponding input string is a wide character string. snprintf will do everything the same, except it transparently converts the wide character string into a regular string before outputting it. For example:

char output[1024];
wchar_t *s = L"abc";

snprintf(output, sizeof(output), "foo %ls bar\n", s);

Will produce the same output as the previous code snippet.

Now, what if you want the output to be a wide character string? Simple, use the wprintf functions! There are fwprintf, wprintf, and swprintf which correspond to fprintf, printf, and snprintf. Do note that the wide-character versions want the format string to be a wide character string. As far as the format string is concerned, the same rules apply as before — %s for char * input and %ls for wchar_t * input:

wchar_t output[1024];
wchar_t *s1 = L"abc";
char *s2 = "abc";

swprintf(output, sizeof(output), L"foo %ls %s bar\n", s1, s2);

Caution! In addition to swprintf there is also wsprintf. This one takes the format string in char * but outputs into a wchar_t * buffer.

Here’s the same information, in a tabular form. The input string type is always determined by the format string contents — %s for char * input and %ls for wchar_t * input:

Function Output Format string
printf, sprintf, snprintf, fprintf char * char *
wprintf, swprintf, fwprintf wchar_t * wchar_t *
wsprintf wchar_t * char *

setlocale and Summary

Oh, I almost forgot! You should call setlocale before you start using all these features.

So, to conclude, it is pretty easy to use wide character strings.

  • #include <wchar.h>
  • #include <widec.h>
  • call setlocale in your main
  • use wchar_t instead of char
  • use %ls in format strings instead of %s
  • use L string literal prefix
  • beware of wsprintf and swprintf

I wouldn’t want to deal with this sort of code on daily basis, but for a random side project it isn’t so bad. I do like the ability to not worry about the encoding — the 1:1 mapping of characters to array elements is really convenient.

by JeffPC at December 06, 2014 01:40 PM

December 01, 2014

Josef "Jeff" Sipek

Delegating mount/umount Privileges

Recently, I was doing some file system changes. Obviously, I wanted to run them as an unprivileged user. Unfortunately, the test involved mounting and unmounting a filesystem (tmpfs to be specific). At first I was going to set up a sudo rule to allow mount and umount to run without asking for a password. Then I remembered that I should be able to give the unprivileged user the additional privileges. It turns out that there is only one privilege (sys_mount) necessary to delegate…and it is easy to do!

$ usermod -K defaultpriv=basic,sys_mount jeffpc

Then it’s a matter of logging out and back in. We can check using ppriv:

$ ppriv $$
925:    bash
flags = <none>
        E: basic,sys_mount
        I: basic,sys_mount
        P: basic,sys_mount
        L: all

At this point, mounting and unmounting works without sudo or similar user switching:

$ mkdir tmp
$ mount -F tmpfs none /tmp/tmp
$ df -h /tmp/tmp
Filesystem      Size  Used Avail Use% Mounted on
swap            2.6G     0  2.6G   0% /tmp/tmp

by JeffPC at December 01, 2014 06:29 PM

November 27, 2014

Josef "Jeff" Sipek

Inline Assembly & GCC, clang

Recently, I got to write a bit of inline assembly. In the process I got to test my changes by making a small C file which defined test function that called the inline function from the header. Then, I could look at the disassembly to verify all was well.

#define _KERNEL
#define _ASM_INLINES
#include <sys/atomic.h>

void test(uint32_t *x)

GCC has been my go to complier for a long time now. So, at first I was using it to debug my inline assembly. I compiled the test programs using:

$ gcc -Wall -O2 -m64 -c test.c

Disassembling the object file yields the rather obvious:

    test:     f0 ff 07           lock incl (%rdi)
    test+0x3: c3                 ret    

I can’t think of any way to make it better :)

Then, at some point I remembered that Clang/LLVM are pretty good as well. I compiled the same file with clang:

$ clang -Wall -O2 -m64 -c test.c

The result was rather disappointing:

    test:     55                 pushq  %rbp
    test+0x1: 48 89 e5           movq   %rsp,%rbp
    test+0x4: f0 ff 07           lock incl (%rdi)
    test+0x7: 5d                 popq   %rbp
    test+0x8: c3                 ret    

For whatever reason, Clang feels the need to push/pop the frame pointer. I did a little bit of searching, and I couldn’t find a way to disable this behavior.

The story for 32-bit output is very similar (just drop the -m64 from the compiler invocation). GCC produced the superior output:

    test:     8b 44 24 04        movl   0x4(%esp),%eax
    test+0x4: f0 ff 00           lock incl (%eax)
    test+0x7: c3                 ret    

While Clang still wanted to muck around with the frame pointer.

    test:     55                 pushl  %ebp
    test+0x1: 89 e5              movl   %esp,%ebp
    test+0x3: 8b 45 08           movl   0x8(%ebp),%eax
    test+0x6: f0 ff 00           lock incl (%eax)
    test+0x9: 5d                 popl   %ebp
    test+0xa: c3                 ret    

For the curious ones, I’m using GCC 4.8.3 and Clang 3.4.2.

I realize this is a bit of a special case (how often to you make a function that simply calls an inline function?), but it makes me worried about what sort of sub-optimal code Clang produces in other cases.

by JeffPC at November 27, 2014 01:05 AM

November 15, 2014

Nate Berry

Using i3 on my Arch USB flash drive

Back in February I wrote about setting up a bootable USB stick with Arch Linux. At the time I was using it with a Dell laptop, but since then have been running it mainly off an old Thinkpad T410s (with a now totally non-functional power cable and a cracked palmrest) that had been retired from […]

by Nate at November 15, 2014 07:08 PM

October 03, 2014

dorgan Ruins Halloween!!

Steer clear of   On July 13, 2014 my mother placed an order for the Anna/Elsa dresses that were on pre-order as she wanted to get the Anna/Elsa dresses for my twin daughters (age 2) so we directed her there as the reviews of the company were great and it pricing was great.  We all understood that this was a pre-order item and would not be receiving it right away.  Some time went by and I wanted to follow up with so I asked my mom to forward me the confirmation email so that I could get the order number and follow up with them. So she did and I sent the following email on 8/25/14 to follow up on the order:

I am trying to follow up on the order status for:  myfancyprincess-xxxxx

I know the order was for pre-order, so I just wanted to follow up on the status for the actual items.

On the same day I received the following response:
According to the date you ordered, you are in our third presale shipment which is not due to arrive here at our location until the end of August/early September.  Once we receive the shipment and check it in, then we will ship in the order received.  If there are no further delays you should see a shipping confirmation somewhere around mid to third week of September. 

Thank you for your business and your continued patience.  We sincerely appreciate it!

Excellent, a quick reply and an approximate date of when to expect the dresses.  So some more time goes by and we hadn't received the dresses yet, so I called my mom and asked her if she had heard anything, and she had not.  So on September 23, 2014 I sent a quick email:

I just wanted to follow up on this order as we still have not received anything and it is now the end of September.

Another 5 days went by without any type of response, so on 9/26/14 I sent another email as their phone system says the way to get the  quick response is to email them:
I have not received my order not a response from you this week, not quite sure what is going on.
On 10/01/14 still no response, so at this point we call and leave a voice message, as well as send another email:
I am sending yet another email to follow up on this order.  Please its getting close to halloween and it was our understanding that we would have these items by now.....

So no response  for most of the day on 10/02/14 so I send the following email, granted its definitely confrontational, but all I am looking for is a status update:
So yet another week has gone by.  Both my wife and I have called and left messages as well as sent email and NOTHING has been responded to.  This is totally UNACCEPTABLE and I will start a social media campaign soon if I don't hear back.

Also being slightly disgruntled at this point, I figured I would try another contact medium, Facebook.  So I posted a message along the same lines as my previous emails. (Which has now been deleted).  That seem to get their attention and I received the following email reply.
Any details regarding the order are released to the purchaser only.  We see we previously responded to an e-mail but that was a mistake.  Thank you.
To which I responded:

Ok please email the purchaser (my mom) with an update and I will contact her to get the details.
And they comment on my Facebook feed also that they have responded to my email as well as forwarded the information along to my Mom, great an update, we are happy.  My mom then tells me that the email states the dresses will not be shipping for another 1-2 weeks and then we'll receive them 5 days after that.  They also explained that this is not their fault and that it is the fault of their manufacturer/distributor.  So hey what are you going to do, so we just have to wait.  Well it seems that they didn't like some of the negative comments that some of my friends/family put in the thread that I had started with them.  So they deleted the post.  And sent the following email to my mom, the original purchaser:

We have gone ahead and canceled this order.  Order delays from our supplier are not our fault and we will not continue to be bashed publicly for something that is not out fault.  we just spoke to our supplier yesterday and they are the ones delaying, NOT us.  We have explained this just this morning to your husband (I think they meant son) who also tried to publicly shame us for this.  We explained it respectfully and nicely.  Yet you felt the need to once again publicly bash us for what we already explained was not our fault.  We are just as upset over this as you are and have on more then one occasion expressed out disappointment that we are the ones taking al the blame for the delays that are not our fault.  We also gave you an option to switch to other in stock dresses (Double the cost) and instead of e-mailing us to work something out you once again went on our page to publicly express your disappointment  (My mom, the actual purchaser, never posted on the page, my wife did when she saw they deleted my post to their page).  You have every right to be disappointed, but please understand that we did not cause this.  You have been refunded in full and the order is now cancelled.

Well that got me really pissed as they could have use the opportunity to shine in a customer service issue, and they chose not to.  So I looked back at some of their Facebook posts to see if anyone else was complaining on their Facebook page and found a recent one within the last day or two and commented on that posting stating to be careful what they post as if they find it "offensive" or that it is "bashing" them they would cancel your order.  Since that comment their Facebook page is now completely locked down, no commenting, no liking and no posting.

Ultimately they should have been sending status updates on these pre-order items, thats the right thing to do.  They also could have used the Facebook posts to shine in customer service but chose to hide everything in email.

I am sorry but if you are on social media you must take the good and the bad with it.  You can't just delete/hide everything that you don't like you have to use it as a tool to show everyone else how you can treat the customer with respect.

In the end who suffers, my 2 year old twin daughters, as we are really close to Halloween and no one else has these costumes in their size.  My wife just told me that she let my daughter Sarah know that she might have to be something else for Halloween and she started to cry.

Shame on you!!

by Donald Organ ( at October 03, 2014 07:21 PM

September 03, 2014

Eitan Adler

Finding the majority element in a stream of numbers

Some time ago I came across the following question.
As input a finite stream stream of numbers is provided. Define an algorithm to find the majority element of the input. The algorithm need not provide a sensible result if no majority element exists. You may assume a transdichotomous memory model.
There are a few definitions which may not be immediately clear:
A possibly infinite set of data which may not be reused in either the forward or backward direction without explicitly storing it.
Majority element
An element in a set which occurs more than half the time.
The integer size is equal to the word size of memory. One does not need to worry about storing partial pieces of integers in separate memory units.
Unfortunately this answer isn't of my own invention, but it is interesting and succinct.

The algorithm (click to view)Using 3 registers the accumulator, the guess and the current element (next):
  1. Initialize accumulator to 0
  2. Accept the next element of the stream and place it into next. If there are no more elements go to step #7.
  3. If accumulator is 0 place next into guess and increment accumulator.
  4. Else if guess matches next increment accumulator
  5. Else decrement accumulator
  6. Go to step 2
  7. Return the value in guess as the result

An interesting property of this algorithm is that it can be implemented in $O(n)$ time even on a single tape Turing Machine.

by Eitan Adler ( at September 03, 2014 12:56 AM

August 13, 2014

Josef "Jeff" Sipek

Serial Console in a Zone

In the past, I’ve talked about serial consoles. I have described how to set up a serial console on Solaris/OpenIndiana. I’ve talked about Grub’s composite console in Illumos-based distros. This time, I’m going do describe the one trick necessary to get tip(1) in a zone working.

In my case, I am using SmartOS to run my zones. Sadly, SmartOS doesn’t support device pass-through of this sort, so I have to tweak the zone config after I create the zone with vmadm.

Let’s assume that the serial port I want to pass through is /dev/term/a. Passing it through into a zone is as easy as:

[root@isis ~]# zonecfg -z 7cff99f6-2b01-464d-9f72-d0ef16ce48af
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af> add device
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af:device> set match=/dev/term/a
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af:device> end
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af> commit

At this point, you’ll probably want to reboot the zone (I don’t remember if it is strictly necessary). Once it is back up, you’ll want to get into the zone and point your software of choice at /dev/term/a. It doesn’t matter that you are in a zone. The same configuration rules apply — in my case, it’s the same change to /etc/remote as I described previously.

by JeffPC at August 13, 2014 08:54 PM

Inlining Atomic Operations

One of the items on my ever growing TODO list (do these ever shrink?) was to see if inlining Illumos’s atomic_* functions would make any difference. (For the record, these functions atomically manipulate variables. You can read more about them in the various man pages — atomic_add, atomic_and, atomic_bits, atomic_cas, atomic_dec, atomic_inc, atomic_or, atomic_swap.) Of course once I looked at the issue deeply enough, I ended up with five cleanup patches. The gist of it is, inlining them caused not only about 1% kernel performance improvement on the benchmarks, but also reduced the kernel size by a couple of kilobytes. You can read all about it in the associated bugs (5042, 5043, 5044, 5045, 5046, 5047) and the patch 0/6 email I sent to the developer list. In this blahg post, I want to talk about how exactly Illumos presents these atomic functions in a stable ABI but at the same time allows for inlines.


It should come as no surprise that the “content” of these functions really needs to be written in assembly. The functions are 100% implemented in assembly in usr/src/common/atomic. There, you will find a directory per architecture. For example, in the amd64 directory, we’ll find the code for a 64-bit atomic increment:

	incq	(%rdi)

The ENTRY, ALTENTRY, and SET_SIZE macros are C preprocessor macros to make writing assembly functions semi-sane. Anyway, this code is used by both the kernel as well as userspace. I am going to ignore the userspace side of the picture and talk about the kernel only.

These assembly functions, get mangled by the C preprocessor, and then are fed into the assembler. The object file is then linked into the rest of the kernel. When a module binary references these functions the krtld (linker-loader) wires up those references to this code.


Replacing these function with inline functions (using the GNU definition) would be fine as far as all the code in Illumos is concerned. However doing so would remove the actual functions (as well as the symbol table entries) and so the linker would not be able to wire up any references from modules. Since Illumos cares about not breaking existing external modules (both open source and closed source), this simple approach would be a no-go.

Inline v2

Before I go into the next and final approach, I’m going to make a small detour through C land.

extern inline

First off, let’s say that we have a simple function, add, that returns the sum of the two integer arguments, and we keep it in a file called add.c:

#include "add.h"

int add(int x, int y)
	return x + y;

In the associated header file, add.h, we may include a prototype like the following to let the compiler know that add exists elsewhere and what types to expect.

extern int add(int, int);

Then, we attempt to call it from a function in, say, test.c:

#include "add.h"

int test()
	return add(5, 7);

Now, let’s turn these two .c files into a .so. We get the obvious result — test calls add:

    test:     be 07 00 00 00     movl   $0x7,%esi
    test+0x5: bf 05 00 00 00     movl   $0x5,%edi
    test+0xa: e9 b1 fe ff ff     jmp    -0x14f	<0xc90>

And the binary contains both functions:

$ /usr/bin/nm | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[74]	|                3520|                   4|FUNC |GLOB |0    |13   |add
[65]	|                3536|                  15|FUNC |GLOB |0    |13   |test

Now suppose that we modify the header file to include the following (assuming GCC’s inline definition):

extern int add(int, int);

extern inline int add(int a, int b)
	return a + b;

If we compile and link the same .so the same way, that is we feed in the object file with the previously used implementation of add, we’ll get a slightly different binary. The invocation of add will use the inlined version:

    test:     b8 0c 00 00 00     movl   $0xc,%eax
    test+0x5: c3                 ret    

But the binary will still include the symbol:

$ /usr/bin/nm | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[72]	|                3408|                   4|FUNC |GLOB |0    |11   |add
[63]	|                3424|                   6|FUNC |GLOB |0    |11   |test

Neat, eh?

extern inline atomic what?

How does this apply to the atomic functions? Pretty simply. As I pointed out, usr/src/common/atomic contains the pure assembly implementations — these are the functions you’ll always find in the symbol table.

The common header file that defines extern prototypes is usr/src/uts/common/sys/atomic.h.

Now, the trick. If you look carefully at the header file, you’ll spot a check on line 39. If all the conditions are true (kernel code, GCC, inline assembly is allowed, and x86), we include asm/atomic.h — which lives at usr/src/uts/intel/asm/atomic.h. This is where the extern inline versions of the atomic functions get defined.

So, kernel code simply includes <sys/atomic.h>, and if the stars align properly, any atomic function use will get inlined.

Phew! This ended up being longer than I expected. :)

by JeffPC at August 13, 2014 06:57 PM

August 12, 2014

Josef "Jeff" Sipek

Grub Composite Console

In the past, I’ve described how to get a serial console going on Illumos based systems. If you ever used a serial console in Grub (regardless of the OS you ended up booting), you probably know that telling Grub to output to a serial port causes the VGA console to become totally useless — it’s blank.

Well, if you are using Illumos, you are in luck. About 5 months ago, Joyent integrated a “composite console” in Grub. You can read the full description in the bug report/feature request. The short version is: all grub output can be sent to both the VGA console as well as over a serial port.

It is very easy to configure. In your menu.lst, change the terminal to composite. For example, this comes from my test box’s config file (omitting the uninteresting bits):

serial --unit=0 --speed=115200
terminal composite

Note the use of composite instead of serial. That’s all there is to it.

by JeffPC at August 12, 2014 03:56 PM

August 06, 2014

Josef "Jeff" Sipek

Operating Systems: Three Easy Pieces

I just found out that Remzi and Andrea decided to write a textbook about operating systems. This is exciting for several reasons. Here are the top two.

First and foremost, the book is free. That’s right, a textbook that is free when every other computer science textbook is easily around $100. Why? I’ll let Remzi make the case. Long story short, publishing a textbook isn’t about making money. It is about sharing ideas. You can download it from the textbook’s website.

Second, the book is by Remzi and Andrea. This pair of professors from the University of Wisconsin is responsible for a ton of amazing storage related research. If you don’t believe me, check out their publication track record.

I suppose I should mention that I have read only very little of the book, but I did push it onto the top of my to-read stack and I’m slowly making my way through it. I’ll let you all know how it goes.

by JeffPC at August 06, 2014 07:12 PM

August 04, 2014

Josef "Jeff" Sipek

Lua Compatibility

Phew! Yesterday afternoon, I decided to upgrade my laptop’s OpenIndiana from 151a9 to “Hipster”. I did it in a bit convoluted way, and hopefully I’ll write about that some other day. In the end, I ended up with a fresh install of the OS with X11 and Gnome. If you’ve ever seen my monitors, you know that I do not use Gnome — I use Notion. So, of course I had it install it. Sadly, OpenIndiana doesn’t ship it so it was up to me to compile it. After the usual fight to get a piece of software to compile on Illumos (a number of the Solaris-isms are still visible), I got it installed. A quick gdm login later, Notion threw me into a minimal environment because something was exploding.

After far too many hours of fighting it, searching online, and trying random things, I concluded that it was not Notion’s fault. Rather, it was something on the system. Eventually, I figured it out. Lua 5.2 (which is standard on Hipster) is not compatible with Lua 5.1 (which is standard on 151a9)! Specifically, a number of functions have been removed and the behavior of other functions changed. Not being a Lua expert (I just deal with it whevever I need to change my window manager’s configuration), it took longer than it should but eventually I managed to get Notion working like it should be.

So, what sort of incompatibilies did I have to work around?


loadstring got renamed to load. This is an easy to fix thing, but still a headache especially if you want to support multiple versions of Lua.


table.maxn got removed. This function returned the largest positive integer key in an associative array (aka. a table) or 0 if there aren’t any. (Lua indexes arrays starting at 1.) The developers decided that it’s so simple that those that want it can write it themselves. Here’s my version:

local function table_maxn(t)
    local mn = 0
    for k, v in pairs(t) do
        if mn < k then
            mn = k
    return mn


table.insert now checks bounds. There doesn’t appear to be any specific way to get old behavior. In my case, I was lucky. The index/positition for the insertion was one higher than table_maxn returned. So, I could replace:

table.insert(ret, pos, newscreen)


ret[pos] = newscreen

Final Thougths

I can understand wanting to deprecate old crufty interfaces, but I’m not sure that the Lua developers did it right. I really think they should have marked those interfaces as obsolete, make any use spit out a warning, and then in a couple of years remove it. I think that not doing this, will hurt Lua 5.2’s adoption.

Yes, I believe there is some sort of a compile time option for Lua to get legacy interfaces, but not everyone wants to recompile Lua because the system installed version wasn’t compiled quite the way that would make things Just Work™.

by JeffPC at August 04, 2014 01:42 PM

August 02, 2014

Josef "Jeff" Sipek

Generating Random Data

Over the years, there have been occasions when I needed to generate random data to feed into whatever system. At times, simply using /dev/random or /dev/urandom was sufficient. At other times, I needed to generate random data at a rate that exceeded what I could get out of /dev/random. This morning, I read Chris’s blog entry about his need for generating lots of random data. I decided that I should write my favorite approach so that others can benefit.

The approach is very simple. There are two phases. First, we set up our own random pool. Second we use the random pool. I am going to use an example throughout the rest of this post. Suppose that we want to make repeated 128 kB writes to a block device and we want the data to be random so that the device can’t do anything clever (e.g., compress or dedup). Say that during this benchmark we want to write out 64 GB total. (In other words, we will issue 524288 writes.)

Setup Phase

During the setup phase, we create a pool of random data. The easiest way is to just read /dev/urandom. Here, we want to read enough data so that the pool is large enough. For our 128kB write example, we’d want at least 1 MB. (I’ll explain the sizing later. I would probably go with something like 8 MB because unless I’m on some sort of limited system, the extra 7 MB of RAM won’t be missed.)

“Using the Pool” Phase

Now that we have the pool, we can use it to generate random buffers. The basic idea is to pick a random offset into the pool and just grab the bytes starting at that location. In our example, we’d pick a random offset between zero and pool size minus 128 kB, and use the 128 kB at that offset.

In pseudo code:

#define BUF_SIZE	(128 * 1024)
#define POOL_SIZE	(1024 * 1024)

static char pool[POOL_SIZE];

char *generate()
	return &pool[rand() % (POOL_SIZE - BUF_SIZE)];

That’s it! You can of course make it more general and let the caller tell you how many bytes they want:

#define POOL_SIZE	(1024 * 1024)

static char pool[POOL_SIZE];

char *generate(size_t len)
	return &pool[rand() % (POOL_SIZE - len)];

It takes a pretty simple argument to show that even a modest sized pool will be able to generate lots of different random buffers. Let’s say we’re dealing with the 128 kB buffer and 1 MB pool case. The pool can return 128 kB starting at offset 0, or offset 1, or offset 2, … or offset 9175043 ($1MB - 128kB - 1B$). This means that there are 917504 possible outputs. Recall, that in our example we were planning on writing out 64 GB in total which was 524288 writes.

$\frac{524288}{917504} = 0.571$

In other words, we are planning on using less than 58% of the possible outputs from our 1 MB pool to write out 64 GB of random data! (An 8 MB pool would yield 6.3% usage.)

If the length is variable, the math gets more complicated, but in a way we get even better results (i.e., lower usage) because to generate the same buffer we would need have the same offset and length. If the caller supplies random (pseudo-random or based on some distribution) lengths, we’re very unlikely to get the same buffer out of the pool.


Some of you may have noticed that we traded generating 128 kB (or a user supplied length) of random data for generating a random integer. There are two options there, either you can use a fast pseudo-random number generator (like the Wikipedia article: Mersenne twister), or you can reuse same pool! In other words:

#define POOL_SIZE	(1024 * 1024)

static char pool[POOL_SIZE];
static size_t *ridx = (size_t *) pool;

char *generate(size_t len)
	if ((uintptr_t) ridx == (uintptr_t)&pool[POOL_SIZE])
		ridx = (size_t *) pool;


	return &pool[ridx % (POOL_SIZE - len)];

I leave it as an exercise for the reader to either make it multi-thread safe, or to make the index passed in similarly to how rand_r takes an argument.

rand_r Considered Harmful

Since we’re on the topic of random number generation, I thought I’d mention what is already rather widely known fact — libc’s rand and rand_r implementations are terrible. At one point, I tried using them with this approach to generate random buffers, but it didn’t take very long before I got repeats! Caveat emptor.

by JeffPC at August 02, 2014 05:08 PM

July 23, 2014

Josef "Jeff" Sipek

Segment Drivers

Lately, I started poking around the Illumos memory management code. As I’ve done in the past, I decided to use this blahg as a place to document some of my discoveries.

Memory Layout

In Illumos (and Solaris), address spaces are managed as sets of segments. Each segment has a base address, length, and a number of other properties. This is true for both process memory as well as kernel memory. Do not confuse these segments with Wikipedia article: memory segmentation that processors like Wikipedia article: x86 provide.

Each process has its own struct as:

> ::pgrep vim
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  10852  10777  10850  10777    101 0x4a004000 ffffff0411e1c0a0 vim
> ffffff0411e1c0a0::print proc_t p_as | ::print struct as a_segtree
a_segtree = {
    a_segtree.avl_root = 0xffffff03f7c62ea8
    a_segtree.avl_compar = as_segcompar
    a_segtree.avl_offset = 0x20
    a_segtree.avl_numnodes = 0x18
    a_segtree.avl_size = 0x60

The kernel address space is maintained in the kas global:

> kas::print a_segtree
a_segtree = {
    a_segtree.avl_root = kvseg+0x20
    a_segtree.avl_compar = as_segcompar
    a_segtree.avl_offset = 0x20
    a_segtree.avl_numnodes = 0x9
    a_segtree.avl_size = 0x60

(Once upon a time this set of segments was a linked list, but for a long while now it has been an AVL tree indexed by the base address.)

Regardless of which address space we’re dealing with, the same rules apply: segments represent contiguous regions within the address space. Each segment can represent a different type of memory. For example, walking the kernel address space segment tree yields nine different segments of four different types (kpm, kmem, kp, and map):

> kas::print a_segtree | ::walk avl | ::printf "%p.%016x %a\n" "struct seg" s_base s_size s_ops
fffffe0000000000.000000031e000000 segkpm_ops
ffffff0000000000.0000000017000000 segkmem_ops
ffffff0017000000.0000000080000000 segkp_ops
ffffff0097000000.00000002fca00000 segkmem_ops
ffffff03d3a00000.0000000004000000 segmap_ops
ffffff03d7a00000.000000fbe8600000 segkmem_ops
ffffffffc0000000.000000003b7fb000 segkmem_ops
fffffffffb800000.0000000000550000 segkmem_ops
ffffffffff800000.0000000000400000 segkmem_ops

Segment Drivers

Illumos comes with seven different architecture- and platform-independent segment drivers. A segment driver is a “driver” that implements a couple of functions to manage a segment of memory. That is, each segment type can handle page faults, page locking, sync operations, etc. differently.

For example, suppose that a page fault occurs because a process tried to load a value from a page that lacks a page table entry. The platform specific (assembly) fault handling code gets invoked by the processor. After doing a little bit of work, it calls into the generic (C) fault handling code, as_fault. There, the segtree AVL tree is consulted and the corresponding segment’s fault operation gets invoked.

(Solaris Internals lists 12 and 11 segment drivers, respectively, in the two editions.) In Illumos, the seven common segment drivers are:

Most of the time, userspace processes do not need to map devices into their address space. In the rare case when a process does want a device mapped (e.g., Xorg), the dev segment driver maintains that mapping.
This segment driver maps the kernel heap, module text, and all early boot memory. (code)
In general, kernel memory is not pageable. In the rare case that something can be in kernel pageable memory, this segment is what maintains the anonymous page mappings.
If possible (you’re on a 64-bit system), the kpm segment driver maps all physical memory into the kernel’s address space. This allows the kernel to not have to set up temporary mappings to operate on physical memory. (code)
The map segment driver is a kernel-only higher performance version of the vn segment driver. (See below.)
This segment driver is responsible for maintaining SysV shared memory segments. (Not to be confused with POSIX shared memory.)
Memory mapped files are handled by the vn segment driver. This includes both regular files as well as anonymous memory.

There are also two platform specific segment drivers:

seg_mf (i86xpv only)
This segment driver is only used by dom0 processes (read: Xen) to map pages from other domains.
seg_nf (sparc v9 only)
The header for the file says that it is for non-faulting loads. I don’t actually know what exactly it is for. (And I don’t care enough to dig deeper given that it is Sparc specific.)

The Reality

This is a lot of different segment drivers. Are all of them used all the time? Well, sort of. The mdb output earlier shows that the (amd64) kernel uses only four different segment drivers (kpm, kmem, kp, and map). A typical userspace process is very boring — it is only made up of vn segments. There are, however, exceptions. For instance, Xorg uses vn and dev. This accounts for six of the seven drivers. The last common segment driver is spt, which provides System V shared memory. (I talked about SysV shared memory previously.) So, on a 64-bit x86 system, all seven common segment drivers are in use.

The story is a bit different on 32-bit kernels. Since a 32-bit system has much smaller address space, the kernel tries to eliminate a number of mappings. Here is the list of segments in a 32-bit kernel:

> kas::print a_segtree | ::walk avl | ::printf "%p %a\n" "struct seg" s_base s_ops
b5802000 segmap_ops
b6800000 segkmem_ops
ef400000 segkmem_ops
fe800000 segkmem_ops
ff000000 segkmem_ops

As you can see, the kp and kpm segments went away. While at first this is surprising, it actually makes perfect sense. When thinking about memory there are two “types” to consider: physical and virtual. In theory, one can have more virtual than physical thanks to the MMU but in reality this is only true on 64-bit systems. The physical memory sizes have outgrown 4 GB a number of years ago and therefore a 32-bit address space can trivially be 100% backed by physical memory. In other words, 32-bit address spaces are tight on virtual memory, while 64-bit address spaces are “tight” on physical memory.

Let’s consider the disappearance of the kp segment on 32-bits. What does kp let us do? It lets us oversubscribe physical memory by backing some virtual memory with disk space. On 32-bit systems we have enough physical memory to back all the virtual memory in the kernel so we don’t need to back some of it by disk. So we have no use for it. (Yes, the kernel still could have paged parts of itself out, but kernel text and data is generally considered important enough to keep it in non-pageable memory. The memory utilization will more than pay for itself by the performance improvement of not having the kernel paged out.)

As I stated before, kpm segments map physical memory into the kernel’s address space for performance reasons (without it the kernel would have to temporarily map a page to access the contents). Therefore, they are good candidates for removal when it comes to slimming down the kernel’s address space demands. (Well, the actual story is the other way… the introduction of 64-bit capable hardware allowed kpm segments to exist to improve kernel performance.)

by JeffPC at July 23, 2014 03:11 PM

July 14, 2014

Josef "Jeff" Sipek

Unix Shared Memory

While investigating whether some memory management code was still in use (I’ll blahg about this in the future), I ended up learning quite a bit about shared memory on Unix systems. Since I managed to run into a couple of non-obvious snags while trying to get a simple test program running, I thought I’d share my findings here for my future self.

All in all, there are three ways to share memory between processes on a modern Unix system.

System V shm

This is the oldest of the three. First you call shmget to set up a shared memory segment and then you call shmat to map it into your address space. Here’s a quick example that does not do any error checking or cleanup:

void sysv_shm()
        int ret;
        void *ptr;

        ret = shmget(0x1234, 4096, IPC_CREAT);
        printf("shmget returned %d (%d: %s)\n", ret, errno,

        ptr = shmat(ret, NULL, SHM_PAGEABLE | SHM_RND);
        printf("shmat returned %p (%d: %s)\n", ptr, errno, strerror(errno));

What’s so tricky about this? Well, by default Illumos’s shmat will return EPERM unless you are root. This sort of makes sense given how this flavor of shared memory is implemented. (Hint: it’s all in the kernel)


As is frequently the case, POSIX came up with a different interface and different semantics for shared memory. Here’s the POSIX shm version of the above function:

void posix_shm()
	int fd;
	void *ptr;

	fd = shm_open("/blah", O_RDWR | O_CREAT, 0666);
	printf("shm_open returned %d (%d: %s)\n", fd, errno,

	ftruncate(fd, 4096); /* IMPORTANT! */

	ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	printf("mmap returned %p (%d: %s)\n", ptr, errno, strerror(errno));

The very important part here is the ftruncate call. Without it, shm_open may create an empty file and mmaping an empty file won’t work very well. (Well, on Illumos mmap succeeds, but you effectively have a 0-length mapping so any loads or stores will result in a SIGBUS. I haven’t tried other OSes.)

Aside from the funny looking path (it must start with a slash, but cannot contain any other slashes), shm_open looks remarkably like the open system call. It turns out that at least on Illumos, shm_open is implemented entirely in libc. The implementation creates a file in /tmp based on the path provided and the file descriptor that it returns is actually a file descriptor for this file in /tmp. For example, “/blah” input translates into “/tmp/.SHMDblah”. (There is a second file “/tmp/.SHMLblah” that doesn’t live very long. I think it is a lock file.) The subsequent mmap call doesn’t have any idea that this file is special in any way.

Does this mean that you can reach around shm_open and manipulate the object directly? Not exactly. POSIX states: “It is unspecified whether the name appears in the file system and is visible to other functions that take pathnames as arguments.”

The big difference between POSIX and SysV shared memory is how you refer to the segment — SysV uses a numeric key, while POSIX uses a path.


The last way of sharing memory involves no specialized APIs. It’s just plain ol’ mmap on an open file. For completeness, here’s the function:

void mmap_shm()
	int fd;
	void *ptr;

	fd = open("/tmp/blah", O_RDWR | O_CREAT, 0666);
	printf("open returned %d (%d: %s)\n", fd, errno, strerror(errno));

	ftruncate(fd, 4096); /* IMPORTANT! */

	ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	printf("mmap returned %p (%d: %s)\n", ptr, errno, strerror(errno));

It is very similar to the POSIX shm code example. As before, we need the ftruncate to make the shared file non-empty.


In case you’ve wondered what SysV or POSIX shm segments look like on Illumos, here’s the pmap output for a process that basically runs the first two examples above.

6343:	./a.out
0000000000400000          8K r-x--  /storage/home/jeffpc/src/shm/a.out
0000000000411000          4K rw---  /storage/home/jeffpc/src/shm/a.out
0000000000412000         16K rw---    [ heap ]
FFFFFD7FFF160000          4K rwxs-    [ dism shmid=0x13 ]
FFFFFD7FFF170000          4K rw-s-  /tmp/.SHMDblah
FFFFFD7FFF180000         24K rwx--    [ anon ]
FFFFFD7FFF190000          4K rwx--    [ anon ]
FFFFFD7FFF1A0000       1596K r-x--  /lib/amd64/
FFFFFD7FFF33F000         52K rw---  /lib/amd64/
FFFFFD7FFF34C000          8K rw---  /lib/amd64/
FFFFFD7FFF350000          4K rwx--    [ anon ]
FFFFFD7FFF360000          4K rwx--    [ anon ]
FFFFFD7FFF370000          4K rw---    [ anon ]
FFFFFD7FFF380000          4K rw---    [ anon ]
FFFFFD7FFF390000          4K rwx--    [ anon ]
FFFFFD7FFF393000        348K r-x--  /lib/amd64/
FFFFFD7FFF3FA000         12K rwx--  /lib/amd64/
FFFFFD7FFF3FD000          8K rwx--  /lib/amd64/
FFFFFD7FFFDFD000         12K rw---    [ stack ]
         total         2120K

You can see that the POSIX shm file got mapped in the standard way (address FFFFFD7FFF170000). The SysV shm segment is special — it is not a plain old memory map (address FFFFFD7FFF160000).

That’s it for today. I’m going to talk about segment types in the different post in the near future.

by JeffPC at July 14, 2014 04:22 PM

June 06, 2014

Josef "Jeff" Sipek

Moving and Downtime

I’ll be moving my server over the next couple of days. I’m working on an email setup to make sure there’s no interruption there. The website and the blahg will however be down until Wednesday evening. Sorry for any inconvenience this may cause.

by JeffPC at June 06, 2014 02:45 PM

May 24, 2014

Nate Berry

Upgrading the Macbook to Ubuntu 14.04

While I certainly have been enjoying running Arch on usb on the Dell, my main machine is still the old MacBook running Ubuntu. I had some extra time today and thought “hey, I should install steam and grab FTL and Kerbal Space Program” but then promptly decided Id rather do upgrades? Im running Gnome3 not […]

by Nate at May 24, 2014 05:56 PM

May 12, 2014

Justin Lintz

Ode to Flickr

Over the last 9 years, I’ve had a few things in my life that have benefited me both professionally and personally that have all tied back to Flickr.

In 2003, Canon had released the first sub-$1000 digital SLR camera, the Canon EOS 300D aka the EOS Digital Rebel. The 300D paved the way for entry level consumers to get involved with the world of digital SLRs, previously the cheapest option was the Canon 10D, available at $2,000. You could now learn photography with full control over your camera and have instant feedback without burning through rolls of film.

I joined Flickr in June of 2005. I was checking the site everyday for months prior, viewing the explore page and wondering how people were taking such incredible photos. How were people getting that “blur” in the background of their photos, and why couldn’t I get that effect with my Nikon Coolpix 2500 camera I had at the time? Researching what made the backgrounds “blurry” (which I later learned is called bokeh, and is related to aperture size), I was introduced into the world of DSLRs.

Each day I would check Flickr and I was realizing my current camera was not going to cut it for taking the types of photos I wanted. Not being able to control the ISO, shutter speed or aperture on my camera was holding me back from learning more about photography. I began researching cameras like crazy on dpreview, reading about Nikon vs Canon. Canon and Nikon both had just released followups to their first DSLRs the 350D and the D70s. I decided I really wanted the 350D and began saving all the money I could, combined with some money I got from my 21st birthday, I was able to finally purchase my first “real” camera in June of 2005.

I joined the original “Delete Me” group and took part in getting my photos torn apart with no filter on the critiques. I loved all the sub-communities Flickr had created from the groups and would spend hours reading discussions and browsing photos in them. I learned about famous photographers and started to begin to appreciate what made their work special. I fell in love with street photography even though to this day I still can’t get over being comfortable taking a strangers photo.

After a year of uploading photos I was fortunate enough that people started reaching out to me for permission to use some of my photos. Small things at first, a Christmas music album cover and marketing materials . My first “big” break came in February 2010, CBS Sunday Morning news contacted me to use a photo I took of Joe Ades, a street salesmen who sold peelers in Union Square. He had passed away and they were doing a feature on him, I granted them the rights to use my photo in the story and it aired on CBS.

A year later, the High Line park in NYC would contact me to ask for my photo to be used for their annual fundraising gala. It was used as the background for the official invitation and was blown up to cover the walls behind the bars at the event. They granted me two tickets in exchange for the photo. It was an amazing feeling to see a photo I took on display like that. Fast forward a year later, the High Line contacted me again to use my photo in a book that was being written about the High Line, during this time, they also separately were having a contest for a photo be chosen for sale in their store with proceeds of the sale going towards the maintenance of the park. That same photo eventually won the contest and is still available for sale on the High Line today (although my mom and girlfriend just bought up all the copies at the park store during our Mother’s day visit).

After graduating college in 2006 I briefly toyed with the idea of moving out to San Francisco to try and get a job working at Flickr, possibly doing web development. I wasn’t really sure what direction I wanted to take my Computer Science degree. Instead I stayed in NY, and started working as a Systems Administrator. In 2009, I came across a talk from John Allspaw and Paul Hammond, who were then the head of the Operations and Engineering group at Flickr. The talk was at O’Reilly’s Velocity conference and was about how Flickr’s operation’s and engineering teams worked together. I had no idea a community existed of operations folks, nor did I know there were even conferences dedicated to what I did for a living. Watching that talk completely changed my outlook on my job as a system’s administrator. I realized there was so much more I could be doing to make myself better at my job and to help those around me at work. That talk was my first view into the web operations community and from there I started reading about what other people were doing in my field and how they were solving problems. Several years later I got to attend my first Velocity conference out in Santa Clara and it was awesome. So thank you John and Paul for doing your talk. In 2012, Paul did another great talk on “Infrastructure for Startups” , that again, completely resonated with me.

A couple years after John and Paul gave their talk at Velocity, John moved out to NY to take a job at Etsy as the VP of Operations.

While working at Bitly a couple of years ago, we hired Matthew Rothenberg aka “mroth” who was previously the head of product at Flickr. I think he spent the first couple of weeks at Bitly just answering my fan-boy questions about Flickr. He even got my account hooked up with a beta feature at the time. Mroth introduced me to John who was nice enough to grab lunch with me at Etsy and give me some great career advice along with Mike Rembetsy (One of the people responsible for hiring me at my first job out of college).

Flickr gave me a hobby I may not have ever enjoyed as much as I do now and provided a platform that led to others to be able to enjoy my work. The lessons shared by the people working at Flickr made me better at my job and introduced me to a community I didn’t know existed at the time.

by justin at May 12, 2014 05:47 PM

May 01, 2014

Josef "Jeff" Sipek

Task Spooler

For a couple of years now, I wished that I could have a mini-batch system on my computers that’d let me submit jobs and they’d execute when the resources became available. This would let me queue up large amount of work and it’d eventually all get processed. I even tried to hack up a dumb little Python script that’d loop over a file executing no more than one per core.

Then, yesterday, I stumbled across Task Spooler. It’s exactly what I was looking for! It lets me queue jobs, supports dependencies between jobs, etc.

I’m hoping to experiment with it in the next couple of days. I’ll let you know how it turns out.

by JeffPC at May 01, 2014 10:27 PM

Bugs in Time

Recently, I blahgd about GCC optimizing code interestingly. There, I mentioned a couple of bugs I’ve stumbled across. I’m going to talk more about them in this post.


It all started when I got assigned a bug at work. “The installer hangs while checking available disks.” That’s the extent of the information I was given along with a test system. It didn’t take long to figure that devfsadm -c disk was waiting on a kernel thread that didn’t seem to be making any progress:


The function of interest here is ibdm_ibnex_port_settle, but before I talk about it I need to mention that the ibdm kmod stashes a ddi_get_time timestamp of when the HCA attached. Now, ibdm_ibnex_port_settle calls ibdm_get_waittime to get a delay to feed to cv_reltimedwait. The delay is (more or less) calculated as: ddi_get_time() - hca_attach_time. This works fine as long as ddi_get_time continues incrementing at a constant rate (1 sec/sec).

You may already see where this is going. The problem is that ddi_get_time returns a Unix timestamp based on the current time-of-day clock. If the TOD setting changes for whatever reason (daylight saving time adjustments, NTP, etc.), the value returned by ddi_get_time may change non-monotonically. This makes it unsuitable for calculating timeouts and wait times. Converting ibdm_get_waittime to use a monotonic clock source (like gethrtime or ddi_get_lbolt) fixes this bug. (Illumos bug 4777)

Things get a bit worse. While figuring out what ddi_get_time does, I noticed that the man page actively encouraged developers to use it for timeouts. (Illumos bug 4776)

Of course, once I knew about this potential abuse, I had to check that there weren’t similar issues elsewhere in the kernel… and so I got to file bugs for iprb (4778), vhci (4779), COMSTAR iSCSI target (4780), sd (4781), usba (4782), emlxs (4786), ipf (4787), mac (4788), amr (4789), arcmsr (4790), aac (4791), and heci (4792).

I’m fixing all except: amr, arcmsr, aac, and heci.


While developing the series of fixes mentioned in the previous section, I ran into the fact that NANOSEC was defined as 1000000000. This made it an int — a 32-bit signed integer (on both ILP32 and LP64).

If NANOSEC (defined this way) is used to convert seconds to nanoseconds (by multiplying), the naive approach will fail with quantities larger than 2 seconds. For example (hrtime_t is a 64-bit signed int):

hrtime_t convert(int secs)
        return (secs * NANOSEC);

Since both secs and NANOSEC are integers, the compiler will compute the product and then sign extend the result to 64-bits. If you look around the Illumos codebase, you’ll see plenty of places that cast or use ULL or LL suffix to make the compiler do the right thing. Why not just change the definition of NANOSEC to include a LL suffix releaving the users of this tedious (and error prone!) duty? Well, now you know what Illumos bug 4809 is about. :)

So, I changed the definition and rebuilt everything. Then, using wsdiff (think: recursive diff that understands how to compare ELF files) I found two places where the before and after binaries differed for non-trivial reasons. (I define a trivial reason as “the compiler decided to use registers differently, but the result is the same”.) Each non-trivial difference implies that there was an expression that changed — it used to be busted!

The first difference was in ZFS (Illumos bug 4810). There, spa_async_tasks_pending miscalculated a timeout making the condition always true.

The second difference was in in.mpathd. 4811). This daemon has a utility function to convert a struct timeval into a hrtime_t. You can read more about it in my previous post.

Before the NANOSEC change, I would have needed casts to fix this. With the change in definition, I don’t have to change a thing! And that’s how a one liner closed three bugs at the same time:

commit b59e2127f21675e88c58a4dd924bc55eeb83c7a6
Author: Josef 'Jeff' Sipek <>
Date:   Mon Apr 28 15:53:04 2014 -0400

    4809 NANOSEC should be 'long long' to avoid integer overflow bugs
    4810 spa_async_tasks_pending suffers from an integer overflow bug
    4811 in.mpathd: tv2ns suffers from an integer overflow bug
    Reviewed by: Marcel Telka <>
    Reviewed by: Dan McDonald <>
    Approved by: Robert Mustacchi <>

by JeffPC at May 01, 2014 09:51 PM

April 25, 2014

Josef "Jeff" Sipek

GCC Optimizations

Recently, I’ve been given a hang bug to work on. This lead me to a another bug related to timing which pushed me to clean up a time related #define which uncovered at least two bugs. Got all that? Good. The rest of this post is going to talk about the changed define, and one of the “at least two bugs”. When I talk about GCC, I’m talking about the Illumos-specific GCC version 4.4.4. (Illumos needs a couple of features that stock GCC doesn’t provide.)

The #define change I’m hoping to make is very simple:

diff --git a/usr/src/uts/common/sys/time.h b/usr/src/uts/common/sys/time.h
--- a/usr/src/uts/common/sys/time.h
+++ b/usr/src/uts/common/sys/time.h
@@ -234,7 +234,7 @@ struct itimerval32 {
 #define        SEC             1
 #define        MILLISEC        1000
 #define        MICROSEC        1000000
-#define        NANOSEC         1000000000
+#define        NANOSEC         1000000000ll
 #define        MSEC2NSEC(m)    ((hrtime_t)(m) * (NANOSEC / MILLISEC))
 #define        NSEC2MSEC(n)    ((n) / (NANOSEC / MILLISEC))

Without it, multiplying by NANOSEC will cause integer overflow issues on IPL32 and LP64 systems (read: basically everywhere).

One of the “at least two bugs“ involves a simple (buggy) function aptly named tv2ns as it converts a struct timeval to a 64-bit nanosecond count:

static int64_t
tv2ns(struct timeval *tvp)
	return (tvp->tv_sec * NANOSEC + tvp->tv_usec * 1000);

At first glance, this function looks correct. The only flaw with it is that first portion of the expression multiplies a time_t (32-bit signed int) with an int (also 32-bit signed) making the result of that subexpression 32-bit signed expression. With NANOSEC changed to a long long, everything works as expected. Now, the fun part… disassembling this function without the fix. You don’t have to be an expert to see that this function is strangely repetitive. I’ve annotated the assembly.

tv2ns:          movl   0x4(%esp),%eax     ; eax = tvp
tv2ns+4:        movl   0x4(%eax),%edx     ; edx = tvp->tv_usec
tv2ns+7:        leal   (%edx,%edx,4),%edx ; edx = edx + 4 * edx
tv2ns+0xa:      leal   (%edx,%edx,4),%edx ;     = 5 * edx
tv2ns+0xd:      leal   (%edx,%edx,4),%edx
; at this point:  edx = 5 * 5 * 5 * tvp->tv_usec,
; which is the same as: 125 * tvp->tv_usec
tv2ns+0x10:     movl   (%eax),%eax        ; eax = tvp->tv_sec
tv2ns+0x12:     leal   (%eax,%eax,4),%eax ; eax = eax + 4 * eax
tv2ns+0x15:     leal   (%eax,%eax,4),%eax ;     = 5 * eax
tv2ns+0x18:     leal   (%eax,%eax,4),%eax
tv2ns+0x1b:     leal   (%eax,%eax,4),%eax
tv2ns+0x1e:     leal   (%eax,%eax,4),%eax
tv2ns+0x21:     leal   (%eax,%eax,4),%eax
tv2ns+0x24:     leal   (%eax,%eax,4),%eax
tv2ns+0x27:     leal   (%eax,%eax,4),%eax
tv2ns+0x2a:     leal   (%eax,%eax,4),%eax
; at this point,  eax = 5 * 5 * 5 * 5 * 5 * 5 * 5 * 5 * 5 * tvp->tv_sec,
; which is the same as: 1953125 * tvp->tv_sec
tv2ns+0x2d:     shll   $0x9,%eax          ; eax <<= 9
; eax = (1953125 * tvp->tv_sec) << 9,
; which suprprisingly ends up being the same as: 1000000000 * tvp->tv_sec
; so, now we have 'eax' with the tv_sec converted to nanoseconds and 'edx'
; with 125 * tv_usec
tv2ns+0x30:     leal   (%eax,%edx,8),%eax ; eax = eax + 8 * edx
; 8 * 125 = 1000, which is the factor to convert tv_usec to nanoseconds!
tv2ns+0x33:     cltd                      ; sign-extend eax to edx:eax
tv2ns+0x34:     ret    

I found it interesting that GCC decided to emit leal instructions to multiply by 5 and then finish it off with a shift and another leal. This is another one of those times when I realize that the compiler is smarter than me. (The sign-extension of course happens too late — all the math needs to happen as 64-bit arithmetic, but that’s not GCC’s fault.)

For the record, with the #define changed, the function looks like the following — sorry, no comments on this one:

tv2ns:          pushl  %edi
tv2ns+1:        pushl  %esi
tv2ns+2:        pushl  %ebx
tv2ns+3:        subl   $0x8,%esp
tv2ns+6:        movl   0x18(%esp),%ecx
tv2ns+0xa:      movl   0x4(%ecx),%eax
tv2ns+0xd:      leal   (%eax,%eax,4),%eax
tv2ns+0x10:     leal   (%eax,%eax,4),%eax
tv2ns+0x13:     leal   (%eax,%eax,4),%ebx
tv2ns+0x16:     shll   $0x3,%ebx
tv2ns+0x19:     movl   %ebx,%esi
tv2ns+0x1b:     sarl   $0x1f,%esi
tv2ns+0x1e:     movl   $0x3b9aca00,%edi
tv2ns+0x23:     movl   (%ecx),%eax
tv2ns+0x25:     imull  %edi
tv2ns+0x27:     movl   %eax,(%esp)
tv2ns+0x2a:     movl   %edx,0x4(%esp)
tv2ns+0x2e:     addl   %ebx,%eax
tv2ns+0x30:     adcl   %esi,%edx
tv2ns+0x32:     addl   $0x8,%esp
tv2ns+0x35:     popl   %ebx
tv2ns+0x36:     popl   %esi
tv2ns+0x37:     popl   %edi
tv2ns+0x38:     ret    

Maybe one day I’ll rummage through my brain and dig up other times that GCC is outsmarted me and blahg about them. :)

by JeffPC at April 25, 2014 10:49 PM

April 10, 2014

Justin Dearing

The case for open sourcing the SQL Saturday Website

My name is Justin Dearing. I write software for a living. I also write software for free as hobby and for personal development. When I’m not writing code, I speak at user groups, events and conferences about code and code related topics. Once such event is SQL Saturday. I haven’t spoken in a while because I became a dad in June. However, my daughter is 9 months old now and the weather is warm. I feel comfortable attending a regional SQL Saturday or two.

So last night I submitted to SQL Saturday Philadelphia. The submission process (I mean the mechanical process of using the website to submit my abstract) was annoying, as usual. What really got me going though was when I realized two things:

  • My newlines were not being preserved so that my asterisks that were supposed to punctuate bullet points were not at the beginnings of lines.
  • I could not edit my submission once submitted.

I like bullet points, a lot. However, I digress. In response to my anger, I complained on twitter that the site should be open sourced, so I the end user could create a better experience for myself and my fellow SQL Saturday Speakers.

I got three retweets. At least I wasn’t completely alone in my sentiment. I complained again in the morning, started a conversation and eventually Tim sent this out this:

So the site was being rewritten, but it would not be open sourced.

Should I have been happy at that point, or at least patiently await the changes? One could presume that session editing and submission would be improved. At the very least, things would get progressively better as there were revisions to the code. If the federal government could pull off the ObamaCare site, with some hiccups, why can’t a group of DBAs launch a much smaller website, with much simpler requirements and lower load?

I’d be willing to bet they will. I’d be willing to bet that this site will suck a lot less than the old site, and that it will continue to progress. I’m sure smart people are working on it, and a passionate BoD are guiding the process. At the very least I’ll withhold judgement until the new site is live.

Despite my confidence in the skills of the unknown (to me) parties working on the site, there are so many hours in the day and only so many things a team of finite size can do. However, a sizable minority of PASS’s membership are .NET developers. Many of them speak at SQL Saturdays. They have to submit to the site. Some of them will no doubt be annoyed at some aspect of the site. Some of them might fix that annoyance, or scratch their itch in OSS parlance, if the site was open source and there was a process to accept pull requests.

I’m not describing a hypothetical nirvana. I’ve seen the process I describe work because I’m submitted a lot of patches to a lot of OSS projects. I’ve submitted a patch to the (not actually open source, as Brent will be the first to state) sp_blitz and Brent accepted it. I’ve contributed to NancyFX. I once contributed a small patch to PHP to make it consume WCF services better. I’ve contributed to several other OSS projects as well.

Perhaps your saying SQL Server is a Microsoft product, not some hippie Linux thing. Perhaps you share the same sentiment as Noel McKinney:

However, as I pointed out to Noel, the mothership’s (i.e, Microsoft’s Editors Note: Noel has stated to me he meant Microsoft) beliefs are not anti OSS. Microsoft has fully embraced Open Source. You can become an MVP purely for OSS without any speaking or forum contributions. One of the authors of NancyFX is an example of such a recipient. F#, ASP.NET and Entity Framework are all open source. Just this week Microsoft Open Sourced Roslyn. As a matter of fact I’ve even submitted a patch to the nuget gallery website, which is operated by Microsoft and owned by the OuterCurve foundation. The patch was accepted and my code, along with the code of others was pushed to So I’ve already submitted source code for a website owned and operated by an independent organization  setup by Microsoft, they’ve already accepted it, and the world seems a slightly better place as a result.

So I ask the PASS BoD to consider releasing the SQL Saturday Website source code on github, and I ask the members of PASS to ask their BoD to release the source code as well.

by Justin at April 10, 2014 03:04 AM

April 07, 2014

Josef "Jeff" Sipek

Happy 50th, System/360

It’s been a while since I blahged about mainframes. Rest assured, I’m still a huge fan, I’m just preoccupied with other things to continuously extoll their virtues.

The reason I’m writing today is because it is the 50th anniversary of the System/360 announcement. Aside from the “50 years already?” sentiment, I have a couple of images to share. (I found these several years ago on someone’s GeoCities site. It’s a good thing I made a mirror :) )

I also came across this video from 1964:

by JeffPC at April 07, 2014 03:44 PM

March 24, 2014

Josef "Jeff" Sipek

Netflix Chaos Monkey

Somehow, I managed to miss that about two years ago Netflix open sourced their chaos monkey.

Based on my quick look over the code, it appears to be written in Java. Meh. Regardless of the language, it’s great to see large companies open source their code.

by JeffPC at March 24, 2014 07:18 PM

March 02, 2014

Josef "Jeff" Sipek

Comment Spam Filtering Experiments

Just a heads up, I’m getting fed up with all the comment spam that ends up on the moderation queue. So, I’m working on some code to reject comment spam before it hits it. As the title for this post implies, these are experiments; I’ll try my best not to reject any valid comments. I appologize if a valid comment does get rejected.

If you end up being a victim of my overzealous filters, please email me:

by JeffPC at March 02, 2014 03:33 PM

February 22, 2014

Josef "Jeff" Sipek

Greetings from Nexenta

In case you missed it, back in mid-2011 I discovered Illumos and OpenIndiana. At that point, I already missed hacking on the (Linux) kernel. Based on my blahg posts [1,2], it shouldn’t surprise you that it didn’t take long before I wanted to hack on the Illumos kernel…and so I did.

If you ever contributed to an open source project in your free time while employed full-time, you understand that there’s only so much time you can devote to the open source project and therefore there is only so much you can do.

A couple of months ago, I decided to explore the possibility of working full-time on Illumos. There are only a handful of companies that visibly participate in the Illumos ecosystem, but their use of Illumos is pretty varied (from public clouds to virtualized databases to SAN/NAS appliances). As of this past Tuesday (Monday was a holiday), I’m at Nexenta. At least for now, I’m working remotely (from Ann Arbor) with the fine folks in the Wikipedia article: Lowell office. It feels great to work on open source again.

by JeffPC at February 22, 2014 06:11 PM

February 16, 2014

Nate Berry

Arch Linux on bootable, persistent USB drive

I recently got a new laptop from work. Its a refurbished Dell Latitude E6330 with an Intel Core i5 processor, a 13″ screen and a 120GB SSD drive that came with Windows 7 Pro. I haven’t used Windows regularly in quite some time (I’ve been using a WinXP VM on the rare occassion I need […]

by Nate at February 16, 2014 08:07 PM

Justin Dearing

Creating a minimally viable CentOS OpenLogic rapache instance

Recently I’ve been dealing with R and rapache at work. R is a language for statisticians. rapache is an apache module for executing R scripts in apache. Its like mod_perl or mod_php for R. I’ve been writing simple RESTful scripts that return graphics and JSON, and calling them from static html pages. I’ve been also using my MSDN Azure subscription to engage in R self study at home. In the spirit of my last post, I’ve posted the setup notes here to get you stated with a new Azure VM for running an rapache instance. Azure used a special cloud enabled version fo CentoS 6.3 called OpenLogic. However, it seems to work similarly to the vanilla CentoOS 6.4 instances I’ve used at work. So everything should apply there. If something doesn’t work leave a comment.

  • First, CentOS is very conservative, but Fedora makes EPEL to give you a more modern set of RPMs
    • rpm -Uvh
  • Now lets install the packages we need. The kernel will be updated, so we will need to reboot.
    • yum update -y
    • yum install -y vim-x11 vim-enhanced xauth R terminator xterm rxvt R httpd git httpd-devel gcc cairo cairo-devel libXt-devel
    • yum groupinstall -y fonts
    • ldconfig
    • shutdown -r now
  • Now as a regular user lets compile rapache.
    • mkdir ~/src
    • cd ~/src
    • git checkout
    • cd rapache
    • ./configure && make && sudo make install
  • Now lets configure rapache. Create a file called /etc/httpd/conf.d/rapache.conf with the following:
# rapache configuration by Justin Dearing <>
LoadModule R_module modules/
<Location /RApacheInfo>
 SetHandler r-info
AddHandler r-script .R
RHandler sys.source
  • Now restart apache.  Make sure it’t working by running 
    elinks http://localhost/RApacheInfo.

Azure doesn’t configure swap space by default. You’re going to absolutely need some swap space if you’re using an extra small instance. A good howto for that is here.

by Justin at February 16, 2014 05:52 AM

February 15, 2014

Justin Lintz

Pager Huety

For a hack week project at Chartbeat, I hooked my Philip’s Hue light bulbs into PagerDuty so whenever I get paged my lights will start flashing. Read about the hack over on the PagerDuty blog

by justin at February 15, 2014 06:52 PM

January 07, 2014

Josef "Jeff" Sipek

Google Traffic

Ever wonder how Google gets its traffic information?


Apparently, there are two sources. The first is the Department of Transportation. The second consists of Android users.

You can always check Google Location History to see what sort of data Google has. (Of course, they may always have more than they show.) Seeing the data can be a bit unnerving. Since I’m not really into giving Google more data than they already have to begin with, and I see no reason for Google to know exactly where I spend my time, I decided to turn this feature off.

Turning it off

You can find the setting by running the “Google Settings” app. That’s right, not “Settings”. Once there, select “Location”.


As you can see, I want to treat Google apps like any other vendor’s apps. As an added bonus, it looks like my GPS is on way less often.

by JeffPC at January 07, 2014 06:22 PM

January 05, 2014

Josef "Jeff" Sipek

x2APIC, IOMMU, Illumos

About a week ago, I hinted at a boot hang I was debugging. I’ve made some progress with it, and along the way I found some interesting things about which I’ll blog over the next few days. Today, I’m going to talk about the Wikipedia article: APIC, xAPIC, and Wikipedia article: x2APIC and how they’re handled in Illumos.


I strongly suggest you become at least a little familiar with APIC architecture before reading on. The Wikipedia articles above are a good start.

First things first, we need some definitions. APIC can refer to either the architecture or to very old (pre-Pentium 4) implementation. Since I’m working with a Sandy Bridge, I’m going to use APIC to refer to the architecture and completely ignore that these chips existed. Everything they do is a subset of xAPIC. xAPIC is an extension to APIC. xAPIC chips started showed up in NetBurst architecture Intel CPUs (i.e., Pentium 4). xAPIC included some goodies such as upping the limit on the number of CPUs to 256 (from 16). x2APIC is an extension to xAPIC. x2APIC chips started appearing around the same time Sandy Bridge systems started showing up. It is a major update to how interrupts are handled, but as with many things in the PC industry the x2APIC is fully backwards compatible with xAPICs. x2APIC includes some goodies such as upping the limit on the number of CPUs to $2^{32}$.

Regardless of which exact flavor you happen to use, you will find two components: the local APIC and I/O APIC. Each processor gets their own local APIC and I/O buses get I/O APICs. I/O APICs can service more than one device, and in fact many systems have only one I/O APIC.

The xAPIC uses Wikipedia article: MMIO to program the local and I/O APICs.

x2APIC has two mode of operation. First, there is the xAPIC compatibility mode which makes the x2APIC behave just like an xAPIC. This mode doesn’t give you all the new bells and whistles. Second, there is the new x2APIC mode. In this mode, the APIC is programmed using Wikipedia article: MSRs.

One interesting fact about x2APIC is that it requires an Wikipedia article: iommu. My Sandy Bridge laptop has an Intel iommu as part of the VT-d feature.

Illumos /etc/mach

x2APIC in Illumos has two APIC drivers. First, there is pcplusmp which knows how to handle APIC and xAPIC. Second, there is apix which targets x2APIC, but knows how to operate it in both modes. On boot, the kernel consults /etc/mach to get a list of machine specific modules to try to load. Currently, the default contents (trimmed for display here) are:

# CAUTION!  The order of modules specified here is very important. If the
# order is not correct it can result in unexpected system behavior. The
# loading of modules is in the reverse order specified here (i.e. the last
# entry is loaded first and the first entry loaded last).

Since I’m not running Xen, xpv_psm will fail to load, and apix gets its chance to load.

pcplusmp + apix Code Sharing

The code in these two modules can be summarized with a word: mess. Following what happens when would be enough of an adventure. The code for the two modules lives in four directories: usr/src/uts/i86pc/io, usr/src/uts/i86pc/io/psm, usr/src/uts/i86pc/io/pcplusmp, and usr/src/uts/i86pc/io/apix. But the sharing isn’t as straight forward as one would hope.

Directory pcplusmp apix
i86pc/io mp_platform_common.c, mp_platform_misc.c, hpet_acpi.c mp_platform_common.c, hpet_acpi.c
i86pc/io/psm psm_common.c psm_common.c
i86pc/io/pcplusmp * apic_regops.c, apic_common.c, apic_timer.c
i86pc/io/apix *

This is of course not clear at all when you look at the code. (Reality is a bit messier because of the i86xpv platform which uses some of the i86pc source.)


When the apix module gets loaded, its probe function (apix_probe) is called. This is the place where the module decides if the hardware is worthy. Specifically, if it finds that the CPU reports x2APIC support via Wikipedia article: cpuid, it goes on to call the common APIC probe code (apic_probe_common). Unless that fails, the system will use the apix module — even if there is no iommu and therefore the x2APIC needs to operate in xAPIC mode.

What mode are you using? Easy, just check the apic_mode global in the kernel:

# echo apic_mode::whatis | mdb -k
fffffffffbd0ee4c is apic_mode, in apix's data segment
# echo apic_mode::print | mdb -k

2 (LOCAL_APIC) indicates xAPIC mode, while 3 (LOCAL_X2APIC) indicates x2APIC mode.

Because this part is as clear as mud, I made a table that tells you what module and mode to expect given your hardware, what CPUID says, and the presence and state of the iommu.

APIC hw CPUID IOMMU IOMMU state Module apic_mode
xAPIC off pcplusmp LOCAL_APIC
x2APIC off pcplusmp LOCAL_APIC
x2APIC on absent apix LOCAL_APIC
x2APIC on present off apix LOCAL_APIC
x2APIC on present on apix LOCAL_X2APIC


I’ve never seen apic_mode equal to LOCAL_X2APIC in the wild. This was very puzzling. Yesterday, I discovered why. As I mentioned earlier, in order for the x2APIC to operate in x2APIC mode an iommu is required. Long story short, the default config that Illumos ships disables iommus on boot. Specifically:

$ cat /platform/i86pc/kernel/drv/rootnex.conf | grep -v '^\(#.*\|\)$'

In order to get LOCAL_X2APIC mode, you need to set:


Once you put those into the config file, update you boot archive and reboot. You should be set… except the iommu support in Illumos is… shall we say… poor.

(I should point out that it is possible for the BIOS to enable x2APIC mode before handing control off to the OS. This is pretty rare unless you have a really big x86 system.)


It would seem that the hci1394 driver doesn’t quite know how to deal with an iommu “messing” with it’s I/Os and its interrupt service routine shuts down the driver. (On a debug build it throws is ASSERT(0) for good measure.) I just disabled 1394 in the BIOS since I don’t have any Firewire devices handy and therefore no use for the port at the moment.

immu-enable Details

In case you want to know how iommu initialization affects the apix initialization…

During boot, immu_init gets called to initialize iommus. If the config option (immu-enable) is not true, the function just returns instead of calling immu_subsystems_setup which calls immu_intrmap_setup which sets psm_vt_ops to non-NULL value.

Later on, when apix is loaded and is initializing itself in apix_picinit, it calls apic_intrmap_init. This function does nothing if psm_vt_ops are NULL.

The Hang

I might as well tell you a bit about my progress on tracking down the hang. It happens only if I’m using the apix module and I allow deep C states in the idle thread (technically, it could also be an mwait related issue since I cannot disable just mwait without disabling deep C states). It does not matter if the apic_mode is LOCAL_APIC or LOCAL_X2APIC.

Assorted Documentation

  1. Intel 64 Architecture x2APIC Specification
  2. Intel MP Spec 1.4

by JeffPC at January 05, 2014 07:38 PM

January 04, 2014

Josef "Jeff" Sipek

Post Preview

One of the blogs I’ve been reading for a few months now just had a post about partial vs. full entries on blog front pages. Since I have some opinions on the subject, I decided to comment. My response turned into something sufficiently content-full that I decided that my blahg would be a better place for it. Sorry, Chris :P

First of all, my blog doesn’t support partial post display because… technical reasons. (The sinking feeling of discovering a design mistake in your code really resonated with me about this exact thing.) With that said, I don’t think that partial display is necessarily bad. I feel like any reasonable (this is of course subjective) blogging software should follow these rules:

  1. if we’re displaying a atom/rss feed, display full post
  2. if we’re displaying a single post, display full post
  3. if the post contains magical marker that denotes where to stop the preview, display everything above the marker
  4. display full post

I really dislike when the feeds give me the first sentence and I have to click a link to read more. At the very least, it is inconvenient, and in extreeme cases it feels outright insulting.

I think the post-by-post-basis Chris suggests is the way to go, but in the absence of a user-defined division point I would display the whole thing.

Do I write many posts where I wish I could use this magical marker? No. If that were the case, I’d make supporting this a higher priority. However, there have been a handful of times where I believe that the rest of the post is uninteresting to…well…just about everyone and it is really long. So long, that you might get bored trying to scroll past it. (If you are reading my blahg, I don’t want you to be bored because you had to scroll for too long to skip over an entry — you are my guest, and I am here to entertain you.) This is the time I believe displaying a partial post is good.

I’m hoping that eventually I’ll wrestle with my blogging software sufficiently to eliminate the technical reasons preventing me from introducing and processing this special marker. Not that you’ll really notice anything different. :)

by JeffPC at January 04, 2014 03:10 PM

January 02, 2014

Josef "Jeff" Sipek

Designated Initializers

Designated initializers are a neat feature in C99 that I’ve used for about 6 years. I can’t fathom why anyone would not use them if C99 is available. (Of course if you have to support pre-C99 compilers, you’re very sad.) In case you’ve never seen them, consider this example that’s perfectly valid C99:

int abc[7] = {
	[1] = 0xabc,
	[2] = 0x12345678,
	[3] = 0x12345678,
	[4] = 0x12345678,
	[5] = 0xdef,

As you may have guessed, indices 1–5 will have the specified value. Indices 0 and 6 will be zero. Cool, eh?

GCC Extensions

Today I learned about a neat GNU extension in GCC to designated initializers. Consider this code snippet:

int abc[7] = {
	[1] = 0xabc,
	[2 ... 5] = 0x12345678,
	[5] = 0xdef,

Mind blowing, isn’t it?

Beware, however… GCC’s -std=c99 will not error out if you use ranges! You need to throw in -pedantic to get a warning.

$ gcc -c -Wall -std=c99 test.c
$ gcc -c -Wall -pedantic -std=c99 test.c
test.c:2:5: warning: ISO C forbids specifying range of elements to initialize [-pedantic]

by JeffPC at January 02, 2014 02:46 PM

December 30, 2013

Josef "Jeff" Sipek


I briefly mentioned that I was debugging a boot hang. Since the hang does not happen every time I try to boot, it may take a couple of reboots to get the kernel to hang. Doing this manually is tedious. Thankfully it can be scripted. Therefore, I made a simple script and a SMF manifest that runs the script at the end of boot. If the system boots fine, my script reboots it. If the system hangs mid-boot, well my script never executes leaving the system in a hung state. Then, I can break into the kernel debugger (mdb) and investigate.

I’m sharing the two here mostly for my benefit… in case one day in the future I decide that I need my system automatically rebooted over and over again.

The script is pretty simple. Hopefully, 60 seconds is long enough to log in and disable the service if necessary. (In reality, I setup a separate boot environment that’s the default choice in Grub. I can just select my normal boot environment and get back to non-timebomb system.)


sleep 60

reboot -p

The tricky part is of course in the manifest. Not because it is hard, but because XML is … verbose.

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type='manifest' name='rebooter'>
	<service name='site/rebooter' type='service' version='1'>
		<dependency name='booted'

		<property_group name="startd" type="framework">
			<propval name="duration" type="astring" value="child"/>
			<propval name="ignore_error" type="astring"

		<instance name='system' enabled='true'>
				timeout_seconds='0' />

				timeout_seconds='0' />

		<stability value='Unstable' />

That’s all, carry on what you were doing. :)

by JeffPC at December 30, 2013 08:44 PM

December 29, 2013

Josef "Jeff" Sipek

CPU Pause Threads

Recently, I ended up debugging a boot hang. (I’m still working on it, so I don’t have a resolution to it yet.) The hang seems to occur during the mp startup. That is, when the boot CPU tries to online all the other CPUs on the system. As a result, I spent a fair amount of time reading the code and poking around with mdb. Given the effort I put in, I decided to document my understanding of how CPUs get brought online during boot in Illumos. In this post, I’ll talk about the CPU pause threads.

Each CPU has a special thread — the pause thread. It is a very high priority thread that’s supposed to preempt everything on the CPU. If all CPUs are executing this high-priority thread, then we know for fact that nothing can possibly be dereferencing the CPU structures’ (cpu_t) pointers. Why is this useful? Here’s a comment from right above cpu_pause — the function pause threads execute:

 * This routine is called to place the CPUs in a safe place so that
 * one of them can be taken off line or placed on line.  What we are
 * trying to do here is prevent a thread from traversing the list
 * of active CPUs while we are changing it or from getting placed on
 * the run queue of a CPU that has just gone off line.  We do this by
 * creating a thread with the highest possible prio for each CPU and
 * having it call this routine.  The advantage of this method is that
 * we can eliminate all checks for CPU_ACTIVE in the disp routines.
 * This makes disp faster at the expense of making p_online() slower
 * which is a good trade off.

The pause thread is pointed to by the CPU structure’s cpu_pause_thread member. A new CPU does not have a pause thread until after it has been added to the list of existing CPUs. (cpu_pause_alloc does the actual allocation.)

CPU pausing is pretty strange. First of all, let’s call the CPU requesting other CPUs to pause the controlling CPU and all online CPUs that will pause the pausing CPUs. (The controlling CPU does not pause itself.) Second, there are two global structures: (1) a global array called safe_list which contains a 8-bit integer for each possible CPU where each element holds a value ranging from 0 to 4 (PAUSE_*) denoting the state of that CPU’s pause thread, and (2) cpu_pause_info which contains some additional goodies used for synchronization.


To pause CPUs, the controlling CPU calls pause_cpus (which uses cpu_pause_start), where it iterates over all the pausing CPUs setting their safe_list entries to PAUSE_IDLE and queueing up (using setbackdq) their pause threads.

Now, just because the pause threads got queued doesn’t mean that they’ll get to execute immediately. That is why the controlling CPU then waits for each of the pause threads to up a semaphore in the cpu_pause_info structure. Once all the pause threads have upped the semaphore, the controlling CPU sets the cp_go flag to let the pause threads know that it’s time for them to go to sleep. Then the controlling CPU waits for each pause thread to signal (via the safe_list) that they have disabled just about all interrupts and that they are spinning (mach_cpu_pause). At this point, pause_cpus knows that all online CPUs are in a safe place.


Starting the CPUs back up is pretty easy. The controlling CPU just needs to set all the CPU’s safe_list to a PAUSE_IDLE. That will cause the pausing CPUs to break out of their spin-loop. Once out of the spin loop, interrupts are re-enabled and a CPU control relinquished (via swtch). The controlling CPU does some cleanup of its own, but that’s all that is to it.


Why not use a mutex or semaphore for everything? The problem lies in the fact that we are in a really fragile state. We don’t want to lose the CPU because we blocked on a semaphore. That’s why this code uses a custom synchronization primitives.

by JeffPC at December 29, 2013 06:45 PM

December 14, 2013

Josef "Jeff" Sipek

iSCSI boot - Success

In my previous post, I documented some steps necessary to get OpenIndiana to boot from iSCSI.

I finally managed to get it to work cleanly. So, here are the remaining details necessary to boot your OI box from iSCSI.


First, boot from one of the OI installation media. I used a USB flash drive. Then, before starting the installer, drop into a shell and connect to the target.

# iscsiadm add discovery-address
# iscsiadm modify discovery -t enable

At this point, you should have all the LUs accessible:

# format
Searching for disks...done

       0. c5t600144F000000000000052A4B4CE0002d0 <SUN-COMSTAR-1.0 cyl 13052 alt 2 hd 255 sec 63>
Specify disk (enter its number): 

Exit the shell and start the installer.

Now, the tricky part… When you get to the network configuration page, you must select the “None” option. Selecting “Automatically” will cause nwam to try to start on boot and it’ll step onto the already configured network interface. That’s it. Finish installation normally. Once you’re ready to reboot, either configure your network card or use iPXE as I’ve shared before.


For the curious, here’s what the iSCSI booted (from the e1000g NIC) system looks like:

# svcs network/physical
STATE          STIME    FMRI
disabled       17:13:10 svc:/network/physical:nwam
online         17:13:15 svc:/network/physical:default
# dladm show-link
e1000g0     phys      1500   up       --         --
# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
e1000g0/?         static   ok 
lo0/v4            static   ok 
lo0/v6            static   ok           ::1/128


Does switching back to the on-board nge NICs work now? No. We still get a lovely panic:

WARNING: Cannot plumb network device 19

panic[cpu0]/thread=fffffffffbc2f400: vfs_mountroot: cannot mount root

Warning - stack not written to the dump buffer
fffffffffbc71ae0 genunix:vfs_mountroot+75 ()
fffffffffbc71b10 genunix:main+136 ()
fffffffffbc71b20 unix:_locore_start+90 ()

by JeffPC at December 14, 2013 06:55 PM

December 08, 2013

Josef "Jeff" Sipek

iSCSI boot

I decided a couple of days ago to try to see if OpenIndiana would still fail to boot from iSCSI like it did about two years ago. This post exists to remind me later what I did. If you find it helpful, great.

First, I got to set up the target. There is a bunch of documentation how to use COMSTAR to export a LU, so I won’t explain. I made a 100 GB LU.

I dug up an older system to act as my test box and disconnected its SATA disk. Booting from the OI USB image was uneventful. Before starting the installer, dropped into a shell and connected to the target (using iscsiadm). Then I installed OI onto the LU. Then, I dropped back into the shell to modify Grub’s menu.lst to use the serial port for both the Grub menu as well as make the kernel direct console output there.

Since the two on-board NICs can’t boot off iSCSI, I ended up using iPXE to boot off iSCSI. First, I made a script file:



Then it was time to grab the source and build it. I did run into a simple problem in a test file, so I patched it trivially.

$ git clone git://
$ cd ipxe
$ cat /tmp/ipxe.patch
diff --git a/src/tests/vsprintf_test.c b/src/tests/vsprintf_test.c
index 11512ec..2231574 100644
--- a/src/tests/vsprintf_test.c
+++ b/src/tests/vsprintf_test.c
@@ -66,7 +66,7 @@ static void vsprintf_test_exec ( void ) {
 	/* Basic format specifiers */
 	snprintf_ok ( 16, "%", "%%" );
 	snprintf_ok ( 16, "ABC", "%c%c%c", 'A', 'B', 'C' );
-	snprintf_ok ( 16, "abc", "%lc%lc%lc", L'a', L'b', L'c' );
+	//snprintf_ok ( 16, "abc", "%lc%lc%lc", L'a', L'b', L'c' );
 	snprintf_ok ( 16, "Hello world", "%s %s", "Hello", "world" );
 	snprintf_ok ( 16, "Goodbye world", "%ls %s", L"Goodbye", "world" );
 	snprintf_ok ( 16, "0x1234abcd", "%p", ( ( void * ) 0x1234abcd ) );
$ patch -p1 < /tmp/ipxe.patch
$ make bin/ipxe.usb EMBED=/tmp/ipxe.script
$ sudo dd if=bin/ipxe.usb of=/dev/rdsk/c8t0d0p0 bs=1M

Now, I had a USB flash drive with iPXE that’d get a DHCP lease and then proceed to boot from my iSCSI target.

Did the system boot? Partially. iPXE did everything right — DHCP, storing the iSCSI information in the Wikipedia article: iBFT, reading from the LU and handing control over to Grub. Grub did the right thing too. Sadly, once within kernel, things didn’t quite work out the way they should.


Was the iBFT getting parsed properly? After reading the code for a while and using mdb to examine the state, I found a convenient tunable (read: global int that can be set using the debugger) that will cause the iSCSI boot parameters to be dumped to the console. It is called iscsi_print_bootprop. Setting it to non-zero will produce nice output:

Welcome to kmdb
kmdb: unable to determine terminal type: assuming `vt100'
Loaded modules: [ unix krtld genunix ]
[0]> iscsi_print_bootprop/W 1
iscsi_print_bootprop:           0               =       0x1
[0]> :c
OpenIndiana Build oi_151a7 64-bit (illumos 13815:61cf2631639d)
SunOS Release 5.11 - Copyright 1983-2010 Oracle and/or its affiliates.
All rights reserved. Use is subject to license terms.
Initiator Name :
Local IP addr  :
Local gateway  :
Local DHCP     :
Local MAC      : 00:02:b3:a8:66:0c
Target Name    :
Target IP      :
Target Port    : 3260
Boot LUN       : 0000-0000-0000-0000

nge vs. e1000g

So, the iBFT was getting parsed properly. The only “error” message to indicate that something was wrong was the “Cannot plumb network device 19”. Searching the code reveals that this is in the rootconf function. After more tracing, it became apparent that the kernel was trying to set up the NIC but was failing to find a device with the MAC address iBFT indicated. (19 is ENODEV)

At this point, it dawned on me that the on-board NICs are mere nge devices. I popped in a PCI-X e1000g moved the cable over and rebooted. Things got a lot farther!

unable to connect

Currently, I’m looking at this output.

NOTICE: Configuring iSCSI boot session...
NOTICE: iscsi connection(5) unable to connect to target
Loading smf(5) service descriptions: 171/171
Hostname: oi-test
Configuring devices.
Loading smf(5) service descriptions: 6/6
NOTICE: iscsi connection(12) unable to connect to target

The odd thing is, while these appear SMF is busy loading manifests and tracing the iSCSI traffic to the target shows that the kernel is doing a bunch of reads and writes. I suspect that all the successful I/O was done over one connection and then something happens and we lose the link. This is where I am now.

by JeffPC at December 08, 2013 04:48 PM

December 01, 2013

Josef "Jeff" Sipek

Meili upgrades

A couple of months ago, I decided to update my almost two and a half year old laptop. Twice.

First, I got more RAM. This upped it to 12 GB. While still on the low side for a box which actually gets to see some heavy usage (compiling illumos takes a couple of hours and generates a couple of GB of binaries), it was better than the 4 GB I used for way too long.

Second, I decided to bite the bullet and replaced the 320 GB disk with a 256 GB SSD (Samsung 840 Pro). Sadly, in the process I had the pleasure of reinstalling the system — both Windows 7 and OpenIndiana. Overall, the installation was uneventful as my Windows partition has no user data and my OI storage is split into two pools (one for system and one for my data).

The nice thing about reinstalling OI was getting back to a stock OI setup. A while ago, I managed to play with software packaging a bit too much and before I knew it I was using a customized fork of OI that I had no intention of maintaining. Of course, I didn’t realize this until it was too late to rollback. Oops. (Specifically, I had a custom pkg build which was incompatible with all versions OI ever released.)

One of the painful things about my messed-up-OI install was that I was running a debug build of illumos. This made some things pretty slow. One such thing was boot. The ZFS related pieces took about a minute alone to complete. The whole boot procedure took about 2.5 minutes. Currently, with a non-debug build and an SSD, my laptop goes from Grub prompt to gdm login in about 40 seconds. I realize that this is an apples to oranges comparison.

I knew SSDs were supposed to be blazing fast, but I resisted getting one for the longest time mostly due to reliability concerns. What changed my mind? I got to use a couple of SSDs in my workstation at work. I saw the performance and I figured that ZFS would take care of alerting me of any corruption. Since most of my work is version controlled, chances are that I wouldn’t lose anything. Lastly, SSDs got a fair amount of improvements over the past few years.

by JeffPC at December 01, 2013 01:30 AM

November 29, 2013

Josef "Jeff" Sipek


Last week I got to spend a bit of time in NYC with obiwan. He’s never been in New York, so he did the tourist thing. I got to tag along on Friday. We went to the Statue of Liberty, Ellis Island, and a pizza place.

You may have noticed that this post is titled “Biometrics,” so what’s NYC got to do with biometrics? Pretty simple. In order to get into the Statue of Liberty, you have to first surrender your bags to a locker and then you have to go through a metal detector. (This is the second time you go through a metal detector — the first is in Battery Park before you get on the boat to Liberty Island.) Once on Liberty Island, you go into a tent before the entrance where you get to leave your bags and $2. Among the maybe 500–600 lockers, there are two or three touch screen interfaces. You use these to rent a locker. After selecting the language you wish to communicate in and feeding in the money, a strobe light goes off blinding you — this is to indicate where you are supposed to place your finger to have your finger print scanned. Your desire to rent a locker aside, you want to put your finger on the scanner to make the strobe go away. Anyway, once the system is happy it pops a random (unused) locker open and tells you to use it.

What could possibly go wrong.

After visiting the statue, we got back to the tent to liberate the bags. At the same touch screen interface, we entered in the locker number and when prompted scanned the correct finger. The fingerprint did not get recognized. After repeating the process about a dozen times, it was time to talk to the people running the place about the malfunction. The person asked for the locker number, went to the same interface that we used, used what looked like a Wikipedia article: one-wire key fob near the top of the device to get an admin interface and then unlocked the locker. That’s it. No verification of if we actually owned the contents of the locker.

I suppose this is no different from a (physical) key operated locker for which you lost the key. The person in charge of renting the lockers has no way to verify your claim to the contents of the locker. Physical keys, however, are extremely durable compared to the rather finicky fingerprint scanners that won’t recognize you if you look at them the wrong way (or have oily or dirty fingers in a different way than they expect). My guess the reason the park service went with a fingerprint based solution instead of a more traditional physical key based solution is simple: people can’t lose the locker keys if you don’t use them. Now, are cheap fingerprint readers accurate enough to not malfunction like this often? Are the people supervising the locker system generally this apathetic about opening a locker without any questions? I do not know, but my observations so far are not very positive.

I suspect more expensive fingerprint readers will perform better. It just doesn’t make sense for something as cheap as a locker to use the more expensive readers.

by JeffPC at November 29, 2013 11:33 PM

November 26, 2013

Nate Berry

Increase disk size of Ubuntu guest running in VMware

A while ago I created a virtual machine (VM) under VMware 5.1 with Ubuntu as the guest OS. I wasn’t giving the task my full attention and I made a couple choices without thinking when setting this one up. The problem I ended up with is that I only allocated about 10GB to the VM […]

by Nate at November 26, 2013 03:25 AM

November 04, 2013

Eitan Adler

Two Factor Authentication for SSH (with Google Authenticator)

Two factor authentication is a method of ensuring that a user has a physical device in addition to their password when logging in to some service. This works by using a time (or counter) based code which is generated by the device and checked by the host machine. Google provides a service which allows one to use their phone as the physical device using a simple app.

This service can be easily configured and greatly increases the security of your host.

Installing Dependencies

  1. There is only one: the Google-Authenticator software itself:
    # pkg install pam_google_authenticator
  2. On older FreeBSD intallations you may use:
    # pkg_add -r pam_google_authenticator
    On Debian derived systems use:
    # apt-get install libpam-google-authenticator

User configuration

Each user must run "google-authenticator" once prior to being able to login with ssh. This will be followed by a series of yes/no prompts which are fairly self-explanatory. Note that the alternate to time-based is to use a counter. It is easy to lose track of which number you are at so most people prefer time-based.
  1. $ google-authenticator
    Do you want authentication tokens to be time-based (y/n)
    Make sure to save the URL or secret key generated here as it will be required later.

Host Configuration

To enable use of Authenticator the host must be set up to use PAM which must be configured to prompt for Authenticator.
  1. Edit the file /etc/pam.d/sshd and add the following in the "auth" section prior to pam_unix:
    auth requisite
  2. Edit /etc/ssh/sshd_config and uncomment
    ChallengeResponseAuthentication yes

Reload ssh config

  1. Finally, the ssh server needs to reload its configuration:
    # service sshd reload

Configure the device

  1. Follow the instructions provided by Google to install the authentication app and setup the phone.

That is it. Try logging into your machine from a remote machine now

Thanks bcallah for proof-reading this post.

by Eitan Adler ( at November 04, 2013 12:56 AM

October 21, 2013

Josef "Jeff" Sipek

Private Pilot, Honeymooning, etc.

Early September was a pretty busy time for me. First, I got my private pilot certificate. Then, three days later, Holly and I got married. We used this as an excuse to take four weeks off and have a nice long honeymoon in Europe (mostly in Prague).

Our flight to Prague (LKPR) had a layover at KJFK. While waiting at the gate at KDTW, I decided to talk to the pilots. They said I should stop by and say hi after we land at JFK. So I did. Holly tagged along.

A little jealous about the left seat

I am impressed with the types of displays they use. Even with direct sunlight you can easily read them.

After about a week in Prague, we rented a plane (a 1982 Cessna 172P) with an instructor and flew around Czech Republic looking at the castles.


I did all the flying, but I let the instructor do all the radio work, and since he was way more familiar with the area he ended up acting sort of like a tour guide. Holly sat behind me and had a blast with the cameras. The flight took us over Wikipedia article: Bezděz, Wikipedia article: Ještěd, Wikipedia article: Bohemian Paradise, and Wikipedia article: Jičín where we stopped for tea. Then we took off again, and headed south over Wikipedia article: Konopiště, Wikipedia article: Karlštejn, and Wikipedia article: Křivoklát. Overall, I logged 3.1 hours in European airspace.

by JeffPC at October 21, 2013 03:27 PM

October 05, 2013


Debian GNU / Linux on Samsung ATIV Book 9 Plus

Samsung just recently released a new piece of kit, ATIV Book 9 plus. Its their top of the line Ultrabook. Being in on the market for a new laptop, when I heard of the specs, I was hooked. Sure it doesn't have the best CPU in a laptop or even amazing amount of ram, in that regard its kind of run of the mill. But that was enough for me. The really amazing thing is the screen, with 3200x1800 resolution and 275DPI. If you were to get a stand alone monitor with similar resolution you'd be forking over anywhere from 50-200% the value of the ATIV Book 9 Plus. Anyway this is not a marketing pitch. As a GNU / Linux user, buying bleeding edge hardware can be a bit intimidating. The problem is that it's not clear if the hardware will work without too much fuss. I couldn't find any reports or folks running GNU / Linux on it, but decided to order one anyway.

My distro of choice is Debian GNU / Linux. So when the machine arrived the first thing I did was, try Debian Live. It did get some tinkering of BIOS (press f2 on boot to enter config) to get it to boot. Mostly because the BIOS UI is horrendus. In the end disabling secure boot was pretty much all it took. Out of the box, most things worked, exception being Wi-Fi and brightness control. At this point I was more or less convinced that getting GNU / Linux running on it would not be too hard.

I proceeded to installing Debian from stable net-boot cd. At first with UEFI enabled but secure boot disabled, installation went over fine but when it came time to boot the machine, it would simply not work. Looked like boot loader wasn't starting properly. I didn't care too much about UEFI so I disabled it completely and re-installed Debian. This time things worked and Debian Stable booted up. I tweaked /etc/apt/sources.list switching from Stable to Testing. Rebooted the machine and noticed that on boot the screen went black. It was rather obvious that the problem was with KMS. Likely the root of the problem was the new kernel (linux-image-3.10-3-amd64) which got pulled in during upgrade to testing. The short term work around is simple, disable KMS (add nomodeset to kernel boot line in grub).

So now I had a booting base system but there was still the problem of Wi-Fi and KMS. I installed latest firmware-iwlwifi which had the required firmware for Intel Corporation Wireless 7260. However Wi-Fi still did not work, fortunately I came across this post on arch linux wiki which states that the Wi-Fi card is only supported in Linux Kernel >=3.11.

After an hour or so of tinkering with kernel configs I got the latest kernel (3.11.3) to boot with working KMS and Wi-Fi. Long story short, until Debian moves to kernel >3.11 you'll need to compile your own or install my custom compiled package. With the latest kernel pretty much everything works this machine. Including the things that are often tricky, like; suspend, backlight control, touchscreen, and obviously Wi-Fi. The only thing remaining thing to figure out, are the volume and keyboard backlight control keys. But for now I'm making due with a software sound mixer. And keyboard backlight can be adjusted with (values: 0-4):

echo "4" > /sys/class/leds/samsung\:\:kbd_backlight/brightness

So if you are looking to get Samsung ATIV Book 9 and wondering if it'll play nice with GNU / Linux. The answer is yes.

by dotCOMmie at October 05, 2013 08:11 PM

August 28, 2013

Josef "Jeff" Sipek

Optimizing for Failure

For the past two years, I’ve been working at Barracuda Networks on a key-value storage system called Moebius. As with any other software project, the development was more focused on stability and basic functionality at first. However lately, we managed to get some spare cycles to consider tackling some of the big features we’ve been wishing for as well as revisiting some of the initial decisions. This includes error handling — specifically how and what size of hardware failures should be handled. During this brainstorming, I made an interesting (in my opinion) observation regarding optimizing systems.

If you take any computer architecture or organization course, you will hear about Wikipedia article: Amdahl’s law. Even if you never took an architecture course or just never heard of Amdahl, eventually you came to the realization that one should optimize for the common case. (Technically, Amdahl’s law is about parallel speedup but the idea of an upper bound on performance improvement applies here as well.) A couple of years ago, when I used to spend more time around architecture people, a day wouldn’t go by when I didn’t hear them focus on making the common case fast, and the uncommon case correct — as well as always guaranteeing forward progress.

My realization is that straightforward optimization for the common case is not sufficient. I’m not claiming that my realization is novel in any way. Simply that it surprised me more than it should.

Suppose you are writing a storage system. The common case (all hardware and software operate correctly) has been optimized and the whole storage system is performing great. Now, suppose that a hardware failure (or even a bug in other software!) occurs. Since this is a rare occurence, you did not optimize for it. The system is still operating, but you want to take some corrective action. Sadly, the failure has caused the system to no longer operate under the common case. So, you have a degraded system whose performance is hindering your corrective action! Ouch!

The answer is to optimize not just for the common case, but for some uncommon cases. Which uncommon cases? Well, the most common ones. :) The problem in the above scenario could have been (hopefully) avoided by not just optimizing for the common case, but also optimizing for the common failure! This is the weird bit… optimize for failures because you will see them.

In the case of a storage system, some failures to consider include:

  • one or more disks failing
  • random bit flips on one or more disks
  • one or more disks responding slowly
  • one or more disks temporarily disappearing and shortly after reappearing
  • low memory conditions

This list is far from exhaustive. You may even decide that some of these failures are outside the scope of your storage system’s reliability guarantees. But no matter what you decide, you need to keep in mind that your system will see failures and it must still behave well enough to not be a hindrance.

None of what I have written here is ground breaking. I just found it sufficiently different from what one normally hears that I thought I would write it up. Sorry architecture friends, the uncommon case needs to be fast too :)

by JeffPC at August 28, 2013 04:06 PM

August 03, 2013

John Lutz

a theorical p2p dynamic messaging and/or voting system which is open sourced for the people.

  The standard model of internet activity is client/server. One server to each client. Another paradigm which is much less often used is Peer to Peer (p2p).

  Peer to Peer allows each client on the internet to serve and well as receive as a client. This allows for a self-administration and self corrective design. But what is most useful and trusted is that control is decentralized. This is good for many reasons; for both uptime and power abuse is *greatly* diminished.

  Some common examples of p2p in action is Bitcoin, Tor and Bittorrent. With the exception of Tor all of these p2p systems are Open Source. Open Source provides the user with the optional ability to self compile and is critical is mediated control to every user involved. It also allows those black boxes called 'Apps', 'Programs','Systems' or 'Applications' the ability for peer review so that nothing suspicious happens without you knowing. (for example bluetooth and web cames automatically set to record and run as the default behaviour.) [I have band aids covering all my personal laptop cameras.]

  There are many services with systems provided with Apple and most notable Windows that prevent us from knowing what traffic becomes transmitted from out personal technological devices. Open Source allows us to not only conserve our personal information but also to extend and share what we've done with those we see fit. Open Source is equated here with Power to The People. And services such as support, administration or development can also use the monetary model. It all depends on each individual situation. The possibilities of Open Source and indeed amazing.

  There are many forms of Open Source software, but none so wordwide known as an operating system called Linux. There are even, in itself many forms of people and group modified Linux. I have with mixed success have used and administrated Debian, Ubuntu, Red Hat, CentOS and SuSE. Depending on any of these or many many other publically downloadable variations it can take as little as 25 minutes or 3 days to successfully fully install and configure these softwares and typical internet apps. You can literally change the source code (if you had a little swagger) to make them change their default behaviour.

  A p2p system in which a bulletin board system (ala a variation of a standard non-p2p model phpBB or vBulletin) delivered messages with time stamps and backups to other nodes could theorically be created along the lines of how famous p2p systems like Bitcoin operate. Except in the case of this theoretical framework instead of virtual currency it would be messages, votes, blogs. Any kind of data. This framework would be a nonstop behemoth using potentially hundreds of thousands or even millions of clients, who in themselves, also acts as servers. Being free from a centralized control system in which the system changes according to the whims of the few are a vital success in producing, like all good policies, a check and balance system free from tyranny.

 I would suggest if ISPS started to ban protocol ports (for example port XXXX where X=1 to 65526) like bitttorrent , a programmer could creatively reprogram this new theoretical p2p message system to alternate between different ports dynamically. That way each very powerful ISP could not ban the people's p2p messaging system as they have, which in my case, was bittorrent.

 I hope you have fully understood what I have written here. If you have any more questions or would like any more indepth to what I've presented here please let me know on @john_t_lutz on twitter. Or here in this blog. Thank you.

Worldy Yours,
John Lutz

by JohnnyL ( at August 03, 2013 07:22 PM

July 16, 2013

Josef "Jeff" Sipek


I just found out about nftw — a libc function to walk a file tree. I did not realize that libc had such a high-level function. I bet it’ll end up saving me time (and code) at some point in the future.

int nftw(const char *path, int (*fn) (const char *, const struct stat *,
				      int, struct FTW *), int depth,
	 int flags);

Given a path, it executes the callback for each file and directory in that tree. Very cool.

Ok, as I write this post, I am told that nftw is a great way to write dangerous code. Aside from easily writing dangerous things equivalent to rm -rf, I could see not specifying the FTW_PHYS to be dangerous as symlinks will get followed without any notification.

I guess I’ll play around with it a little to see what the right way (if any?) to use it is.

by JeffPC at July 16, 2013 04:14 PM

June 29, 2013

Josef "Jeff" Sipek


After several years of having a desktop at home that’s been unplugged and unused I decided that it was time to make a home server to do some of my development on and just to keep files stored safely and redundantly. This was in August 2011. A lot has happened since then. First of all, I rebuilt the OpenIndiana (an Illumos-based distribution) setup with SmartOS (another Illumos-based distribution). Since I wrote most of this a long time ago, some of the information below is obsolete. I am sharing it anyway since others may find it useful. Toward the end of the post, I’ll go over SmartOS rebuild. As you may have guessed, the hostname for this box ended up being Wikipedia article: Isis.

First of all, I should list my goals.

storage box
The obvious mix for digital photos, source code repositories, assorted documents, and email backup is easy enough to store. It however becomes a nightmare if you need to keep track where they are (i.e., which of the two external disks, public server (Odin), laptop drives, desktop drives they are on). Since none of them are explicitly public, it makes sense to keep them near home instead on my public server that’s in a data-center with a fairly slow uplink (1 Mbit/s burstable to 10 Mbits/s, billed at 95th percentile).
dev box
I have a fast enough laptop (Thinkpad T520), but a beefier system that I can let compile large amounts of code is always nice. It will also let me run several virtual machines and zones comfortably — for development, system administration experiments, and other fun stuff.
I have an old Linksys WRT54G (rev. 3) that has served me well for the years. Sadly, it is getting a bit in my way — IPv6 tunneling over IPv4 is difficult, the 100 Mbit/s switch makes it harder to transfer files between computers, etc. If I am making a server that will be always on, it should handle effortlessly NAT’ing my Comcast internet connection. Having a full-fledged server doing the routing will also let me do better traffic shaping & filtering to make the connection feel better.

Now that you know what sort of goals I have, let’s take a closer look at the requirments for the hardware.

  1. reliable
  2. friendly to OpenIndiana and ZFS
  3. low-power
  4. fast
  5. virtualization assists (to support run virtual machines at reasonable speed)
  6. cheap
  7. quiet
  8. spacious (storage-wise)

While each one of them is pretty easy to accomplish, their combination is much harder to achieve. Also note that is ordered from most to least important. As you will see, reliability dictated many of my choices.

The Shopping List

Intel Xeon E3-1230 Sandy Bridge 3.2GHz LGA 1155 80W Quad-Core Server Processor BX80623E31230
Kingston ValueRAM 4GB 240-Pin DDR3 SDRAM DDR3 1333 ECC Unbuffered Server Memory Model KVR1333D3E9S/4G
SUPERMICRO MBD-X9SCL-O LGA 1155 Intel C202 Micro ATX Intel Xeon E3 Server Motherboard
SUPERMICRO CSE-743T-500B Black Pedestal Server Case
Data Drives (3)
Seagate Barracuda Green ST2000DL003 2TB 5900 RPM SATA 6.0Gb/s 3.5"
System Drives (2)
Western Digital WD1600BEVT 160 GB 5400RPM SATA 8 MB 2.5-Inch Notebook Hard Drive
Additional NIC
Intel EXPI9301CT 10/100/1000Mbps PCI-Express Desktop Adapter Gigabit CT

To measure the power utilization, I got a P3 International P4400 Kill A Watt Electricity Usage Monitor. All my power usage numbers are based on watching the digital display.

Intel vs. AMD

I’ve read Constantin’s OpenSolaris ZFS Home Server Reference Design and I couldn’t help but agree that ECC should be a standard feature on all processors. Constantin pointed out that many more AMD processors support ECC and that as long as you got a motherboard that supported it as well you are set. I started looking around at AMD processors but my search was derailed by Joyent’s announcement that they ported KVM to Illumos — the core of OpenIndiana including the kernel. Unfortunately for AMD, this port supports only Intel CPUs. I switched gears and started looking at Intel CPUs.

In a way I wish I had a better reason for choosing Intel over AMD but that’s the truth. I didn’t want to wait for AMD’s processors to be supported by the KVM port.

So, why did I get a 3.2GHz Xeon (E3-1230)? I actually started by looking for motherboards. At first, I looked at desktop (read: cheap) motherboards. Sadly, none of the Intel-based boards I’ve seen supported ECC memory. Looking at server-class boards made the search for ECC support trivial. I was surprised to find a Supermicro motherboard (MBD-X9SCL-O) for $160. It supports up to 32 GB of ECC RAM (4x 8 GB DIMMs). Rather cheap, ECC memory, dual gigabit LAN (even though one of the LAN ports uses the Intel 82579 which was unsupported by OpenIndiana at the time), 6 SATA II ports — a nice board by any standard. This motherboard uses the LGA 1155 socket. That more or less means that I was “stuck” with getting a Sandy Bridge processor. :-D The E3-1230 is one of the slower E3 series processors, but it is still very fast compared to most of the other processors in the same price range. Additionally, it’s “only” 80 Watt chip compared to many 95 or even 130 Watt chips from the previous series.

There you have it. The processor was more or less determined by the motherboard choice. Well, that’s being rather unfair. It just ended up being a good combination of processor and motherboard — a cheap server board and near-bottom-of-the-line processor that happens to be really sweet.

Now that I had a processor and a motherboard picked out, it was time to get RAM. In the past, I’ve had good luck with Kingston, and since it happened to be the cheapest ECC 4 GB DIMMs on NewEgg, I got 4 — for a grand total of 16 GB.


I will let you know a secret. I love hotswap drive bays. They just make your life easier — from being able to lift a case up high to put it on a shelf without having to lift all those heavy drives at the same time, to quickly replacing a dead drive without taking the whole system down.

I like my public server’s case (Supermicro CSE-743T-645B) but the 645 Watt power supply is really an overkill for my needs. The four 5000 RPM fans on the midplane are pretty loud when they go full speed. I looked around, and I found a 500 Watt (80%+ efficiency) variant of the case (CSE-743-500B). Still a beefy power supply but closer to what one sees in high end desktops. With this case, I get eight 3.5" hot-swap bays, and three 5.25" external (non-hotswap) bays. This case shouldn’t be a limiting factor in any way.

I intended to move my DVD+RW drive from my desktop but that didn’t work out as well as I hoped.


At the time I was constructing Isis, I was experimenting with Wikipedia article: ZFS on OpenIndiana. I was more than impressed, and I wanted it to manage the storage on my home sever. ZFS is more than just a filesystem, it is also a volume manager. In other words, you can give it multiple disks and tell it to put your data on them in several different ways that closely resemble RAID levels. It can stripe, mirror, or calculate one to three parities. Wikipedia has a nice article outlining ZFS’s features. Anyway, I strongly support ZFS’s attitude toward losing data — do everything to prevent it in the first place.

Hard drives are very interesting devices. Their reliability varies with so many variables (e.g., manufacturing defects, firmware bugs). In general, manufacturers give you fairly meaningless looking, yet impressive sounding numbers about their drives reliability. Richard Elling made a great blog post where he analyzed ZFS RAID space versus Mean-Time-To-Data-Loss, or MTTDL for short. (Later, he analyzed a different MTTDL model.)

The short version of the story is nicely summed up by this graph (taken from Richard’s blog):

While this scatter plot is for a specific model of a high-end server, it applies to storage in general. I like how the various types of redundancy clump up.

Anyway, how much do I care about my files? Most of my code lives in distributed version control systems, so losing one machine wouldn’t be a problem for those. The other files would be a bigger problem. While it wouldn’t be a complete end of the world if I lost all my photos, I’d rather not lose them. This goes back to the requirements list — I prefer reliable over spacious. That’s why I went with 3-way mirror of 2 TB Seagate Barracuda Green drives. It gets me only 2 TB of usable space, but at the same time I should be able to keep my files forever. These are the data drives. I also got two 2.5" 160 GB Western Digital laptop drives to hold the system files — mirrored of course.

Around the same time I was discovering that the only sane way to keep your files was mirroring, I stumbled across Constantin’s RAID Greed post. He basically says the same thing — use 3-way mirror and your files will be happy.

Now, you might be asking… 2 TB, that’s not a lot of space. What if you out grow it? My answer is simple: ZFS handles that for me. I can easily buy three more drives, plug them in and add them as a second 3-way mirror and ZFS will happily stripe across the two mirrors. I considered buying 6 disks right away, but realized that it’ll probably be at least 6-9 months before I’ll have more than 2 TB of data. So, if I postpone the purchase of the 3 additional drives, I can save money. It turns out that a year and a half later, I’m still below 70% of the 2 TB.


I knew that one of the on-board LAN ports was not yet supported by Illumos, and so I threw a PCI-e Gigabit ethernet card into the shopping cart. I went with an Intel gigabit card. Illumos has since gained support for 82579-based NICs, but I’m lazy and so I’m still using the PCI-e NIC.

Base System

As the ordered components started showing up, I started assembling them. Thankfully, the CPU, RAM, motherboard, and case showed up at the same time preventing me from going crazy. The CPU came with a stock Intel heatsink.

The system started up fine. I went into the BIOS and did the usual new-system tweaking — make sure SATA ports are in AHCI mode, stagger the disk spinup to prevent unnecessary load peaks at boot, change the boot order to skip PXE, etc. While roaming around the menu options, I discovered that the motherboard can boot from iSCSI. Pretty neat, but useless for me on this system.

The BIOS has a menu screen that displays the fan speeds and the system and processor temperatures. With the fan on the heatsink and only one midplane fan connected the system ran at about 1°C higher than room temperature and the CPU was about 7°C higher than room temperature.

OS Installation

Anyway, it was time to install OpenIndiana. I put my desktop’s DVD+RW in the case and then realized that the motherboard doesn’t have any IDE ports! Oh well, time to use a USB flash drive instead. At this point, I had only the 2 system drives. I connected one to the first SATA port, put a 151 development snapshot (text installer) on my only USB flash drive. The installer booted just fine. Installation was uneventful. The one potentially out of the ordinary thing I did was to not configure any networking. Instead, I set it up manually after the first boot, but more about that later.

With OI installed on one disk, it was time to set up the rpool mirror. I used Constantin’s Mirroring Your ZFS Root Pool as the general guide even though it is pretty straight forward — duplicate the partition (and slice) scheme on the second disk, add the new slice to the root pool, and then install grub on it. Everything worked out nicely.

# zpool status rpool
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 0h5m with 0 errors on Sun Sep 18 14:15:24 2011

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c2t0d0s0  ONLINE       0     0     0
            c2t1d0s0  ONLINE       0     0     0

errors: No known data errors


Since I wanted this box to act as a router, the network setup was a bit more…complicated (and quite possibly over-engineered). This is why I elected to do all the network setup by hand later than having to “fix” whatever damage the installer did. :)

I powered it off, put in the extra ethernet card I got, and powered it back on. To my surprise, the new device didn’t show up in dladm. I remembered that I should trigger the device reconfiguration. A short touch /reconfigure && reboot later, dladm listed two physical NICs.

network diagram

As you can see, I decided that the routing should be done in a zone. This way, all the routing settings are nicely contained in a single place that does nothing else.

Setting up the virtual interfaces was pretty easy thanks to dladm. Setting the static IP on the global zone was equally trivial.

# dladm create-vlan -l e1000g0 -v 11 vlan11
# dladm create-vnic -l e1000g0 vlan0
# dladm create-vnic -l e1000g0 internal0
# dladm create-vnic -l e1000g1 isp0
# dladm create-etherstub zoneswitch0
# dladm create-vnic -l zoneswitch0 zone_router0

# ipadm create-if internal0
# ipadm create-addr -T static -a local= internal/v4

You might be wondering about the vlan11 interface that’s on a separate Wikipedia article: VLAN. The idea was to have my WRT54G continue serving as a wifi access point, but have all the traffic end up on VLAN #11. The router zone would then get to decide whether the user is worthy of LAN or Internet access. I never finished poking around the WRT54G to figure out how to have it dump everything on a VLAN #11 instead of the default #0.

Router zone

OpenSolaris (and therefore all Illumos derivatives) has a wonderful feature called Wikipedia article: zones. It is essentially a super-lightweight virtualization mechanism. While talking to a couple of people on IRC, I decided that I, like them, would use a dedicated zone as a router.

Just before I set up the router zone, the storage disks arrived. The router zone ended up being stored on this array. See the storage section below for details about this storage pool.

After installing the zone via zonecfg and zoneadm, it was time to set up the routing and firewalling. First, install the ipfilter package (pkg install pkg:/network/ipfilter). Now, it is time to configure the NAT and filter rules.

NAT is easy to set up. Just plop a couple of lines into /etc/ipf/ipnat.conf:

map isp0 -> 0/32 proxy port ftp ftp/tcp
map isp0 -> 0/32 portmap tcp/udp auto
map isp0 -> 0/32

map isp0 -> 0/32 proxy port ftp ftp/tcp
map isp0 -> 0/32 portmap tcp/udp auto
map isp0 -> 0/32

map isp0 -> 0/32 proxy port ftp ftp/tcp
map isp0 -> 0/32 portmap tcp/udp auto
map isp0 -> 0/32

IPFilter is a bit trickier to set up. The rules need to handle more cases. In general, I tried to be a bit paranoid about the rules. For example, I drop all traffic for IP addresses that don’t belong on that interface (I should never see addresses on my ISP interface). The only snag was in the defaults for the ipfilter Wikipedia article: SMF service. By default, it expects you to put your rules into SMF properties. I wanted to use the more old-school approach of using a config file. Thankfully, I quickly found a blog post which hepled me with it.

Storage, part 2

As the list of components implies, I wanted to make two arrays. I already mentioned the rpool mirror. Once the three 2 TB disks arrived, I hooked them up and created a 3-way mirror (zpool create storage mirror c2t3d0 c2t4d0 c2t5d0).

# zpool status storage
  pool: storage
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Sun Sep 18 14:10:22 2011

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c2t3d0  ONLINE       0     0     0
            c2t4d0  ONLINE       0     0     0
            c2t5d0  ONLINE       0     0     0

errors: No known data errors

Deduplication & Compression

I suspected that there would be enough files that would be stored several times — system binaries for zones, clones of source trees, etc. ZFS has built-in online Wikipedia article: deduplication. This stores each unique block only once. It’s easy enough to turn on: zfs set dedup=on storage.

Additionally, ZFS has transparent data (and metadata) compression featuring Wikipedia article: LZJB and gzip algorithms.

I enabled dedup and kept compression off. Dedup did take care of the duplicate binaries between all the zones. It even took care of duplicates in my photo stash. (At some point, I managed to end up with several diverged copies of my photo stash. One of the first things I did with Isis, was to dump all of them in the same place and start sorting them. Adobe Lightroom helped here quite a bit.)

After a while, I came to the realization that for most workloads I run, dedup was wasteful and I would be better off disabling dedup and enabling light compression (i.e., LZJB).


The installer puts the non-privileged user’s home directory onto the root pool. I did not want to keep it there since I now had the storage pool. After a bit of thought, I decided to zfs create storage/home and then transfer over the current home directory. I could have used cp(1) or rsync(1), but I thought it would be more fun (and a learning experience) to use zfs send and zfs recv. It went something like this:

# zfs snapshot rpool/export/home/jeffpc@snap
# zfs send rpool/export/home/jeffpc@snap | zfs recv storage/home/jeffpc

In theory, any modifications to my home directory after the snapshot got lost, but since I was just ssh’d in there wasn’t much that changed. (I am ok with losing the last update to .bash_history this one time.) The last thing that needed changing is /etc/auto_home — which tells the automounter where my $HOME really is. This is the resulting file after the change (without the copyright comment):

jeffpc	localhost:/storage/home/&

For good measure, I rebooted to make sure things would come up properly — they did.

Since the server is not intended just for me, I created the other user account with a home directory in storage/home/holly.


I intend to use zones extensively. To keep their files out of the way, I decided on storage/zones/$ZONE_NAME. I’ll talk more about the zones I set up later in the Zones section.


Local storage is great, but there is only so much you can do with it. Sooner or later, you will want to access it from a different computer. There are many different ways to “export” your data, but as one might expert, they all have their benefits and drawbacks. ZFS makes it really easy to export data via NFS and CIFS. After a lot of thought, I decided that CIFS would work a bit better. The major benefit of CIFS over NFS is that it Just Works™ on all the major operating systems. That’s not to say that NFS does not work, but rather that it needs a bit more…convincing at times. This is especially true on Windows.

I followed the documentation for enabling CIFS on Solaris 11. Yes, I know, OpenIndiana isn’t Solaris 11, but this aspect was the same. This ended with me enabling sharing of several datasets like this:

# zfs set sharesmb=name=photos storage/photos


The home directory shares are all done. The photos share, however, needs a bit more work. Specifically, it should be fully accessible to the users that are supposed to have access (i.e., jeffpc & holly). The easiest way I can find is to use ZFS ACLs.

First, I set the aclmode to passthrough (zfs set aclmode=passthough storage). This will prevent a chmod(1) on a file or directory from blowing away all the ACEs (Access Control Entries?). Then on the share directory, I added two ACL entries that allow everything.

# /usr/bin/ls -dV /share/photos
drwxr-xr-x   2 jeffpc   root           4 Sep 23 09:12 /share/photos
# /usr/bin/chmod A+user:jeffpc:rwxpdDaARWcCos:fd:allow /share/photos
# /usr/bin/chmod A+user:holly:rwxpdDaARWcCos:fd:allow /share/photos
# /usr/bin/chmod A2- /share/photos # get rid of user
# /usr/bin/chmod A2- /share/photos # get rid of group
# /usr/bin/chmod A2- /share/photos # get rid of everyone
# /usr/bin/ls -dV /share/photos
drwx------+  2 jeffpc   root           4 Sep 23 09:12 /share/photos

The first two chmod commands prepend two ACEs. The next three remove ACE number 2 (the third entry). Since the directory started of with three ACEs (representing the standard Unix permissions), the second set of chmods removes those, leaving only the two user ACEs behind.


That was easy! In case you are wondering, the Solaris/Illumos CIFS service does not allow guest access. You must login to use any of the shares.

Anyway, here’s the end result:

Pretty neat, eh?


Aside from the router zone, there were a number of other zones. Most of them were for Illumos and OpenIndiana development.

I don’t remember much of the details since this predates the SmartOS conversion.


When I first measured the system, it was drawing about 40-45 Watts while idle. Now, I have Isis along with the WRT54G and a gigabit switch on a UPS that tells me that I’m using about 60 Watts when idle. The load can spike up quite a bit if I put load on the 4 Xeon cores and give the disks something to do. (Afterall, it is an 80 Watt CPU!) While this is by no means super low-power, it is low enough and at the same time I have the capability to actually get work done instead of waiting for hours for something to compile.


As I already mentioned, I ended up rebuilding the system with SmartOS. SmartOS is not a general purpose distro. Rather, it strives to be a hypervisor with utilities that make guest management trivial. Guests can either be zones, or KVM-powered virtual machines. Here are the major changes from the OpenIndiana setup.

Storage — pools

SmartOS is one of those distros you do not install. It always netboots, boots from a USB stick or a CD. As a result, you do not need a system drive. This immediately obsoleted the two laptop drives. Conveniently, around the same time, Holly’s laptop suffered from a disk failure so Isis got to donate one of the unused 2.5" system disks.

SmartOS calls its data pool “zones”, which took a little bit of getting used to. There’s a way to import other pools, but wanted to keep the settings as vanilla as possible.

At some point, I threw in a Intel 160 GB SSD to use for L2ARC and Wikipedia article: ZIL.

Here’s what the pool looks like:

# zpool status
  pool: zones
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0 in 2h59m with 0 errors on Sun Jan 13 08:37:37 2013

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0
          c1t1d0s0  ONLINE       0     0     0
          c1t1d0s1  ONLINE       0     0     0

errors: No known data errors

In case you are wondering about the features related status message, I created the zones pool way back when Illumos (and therefore SmartOS) had only two ZFS features. Since then, Illumos added one and Joyent added one to SmartOS.

# zpool get all zones | /usr/xpg4/bin/grep -E '(PROP|feature)'
NAME   PROPERTY                   VALUE                      SOURCE
zones  feature@async_destroy      enabled                    local
zones  feature@empty_bpobj        active                     local
zones  feature@lz4_compress       disabled                   local
zones  feature@filesystem_limits  disabled                   local

I haven’t experimented with either enough to enable it on a production system I rely on so much.

Storage — deduplication & compression

The rebuild gave me a chance to start with a clean slate. Specifically, it gave me a chance to get rid off the dedup table. (The dedup table, DDT, is built as writes happen to the filesystem with dedup enabled.) Data deduplication relies on some form of data structure (the most trivial one is a hash table) that maps the hash of the data to the data. In ZFS, the DDT maps the Wikipedia article: SHA-256 of the block to the block address.

The reason I stopped using dedup on my systems was pretty straight forward (and not specific to ZFS). Every entry in the DDT has an overhead. So, ideally, every entry in the DDT is referenced at least twice. If a block is referenced only once, then one would be better off without the block taking up an entry in the DDT. Additionally, every time a reference is taken or released, the DDT needs to be updated. This causes very nasty random I/O under which spinning disks want to weep. It turns out, that a “normal” user will have mostly unique data rendering deduplication impractical.

That’s why I stopped using dedup. Instead, I became convinced that most of the time light compression is the way to go. Lightly compressing the data will result in I/O bandwidth savings as well as capacity savings with little overhead given today’s processor speeds versus I/O latencies. Since I haven’t had time to experiment with the recently integrated LZ4, I still use LZJB.

by JeffPC at June 29, 2013 03:22 PM

June 22, 2013

Josef "Jeff" Sipek

First Solo Cross-Country

A week ago (June 15), I went on my first solo cross country flight. The plan was to fly KARBKMBSKAMN → KARB. In case you don’t happen to have the Detroit sectional chart in front of you, this might help you visualize the scope of the flight.

leg distance time
KARB → KMBS 79 nm 47 min
KMBS → KAMN 29 nm 20 min
KAMN → KARB 79 nm 46 min
Total 187 nm 113 min

Here’s the ground track (as recorded by the G1000) along with red dots for each of my checkpoints and a pink line connecting them. (Sadly, there’s no convenient zoom level that covers the entire track without excessive waste.)

ground track

As you can see, I didn’t quite overfly all the checkpoints. In my defense, the forecast winds were about 40 degrees off from reality during the first half of the flight. :)

Let’s examine each leg separately.


ground track

My checkpoint by I-69 (southwest of Flint) was supposed to be a I-69 and Pontiac VORTAC (PSI) radial 311 intersection. However when I called up the FSS briefer, I found out that it was out of service. Thankfully, Salem VORTAC (SVM) is very close so I just used its radial 339 instead. Next time I’m using a VOR for any part of my planning, I’m going to check for any NOTAMs before I make it part of my plan — redoing portions of the plan is tedious and not fun.

On the way to Saginaw, I was planning to go at 3500. (Yes, I know, it is a westerly direction and the rule (FAR 91.159) says even thousand + 500, but the clouds were not high enough to fly at 4500 and the rule only applies 3000 AGL and above — the ground around these parts is 700-1000 feet MSL.)


Right when I entered the downwind for runway 23, the tower cleared me to land. My clearance was quickly followed by the tower instructing a commuter jet to hold short of 23 because of landing traffic — me! Somehow, it is very satisfying to see a real plane (CRJ-200) have to wait for little ol’ me to land. (FlightAware tells me that it was FLG3903 flight to KDTW.)

While I was on taxiway C, they got cleared to take off. I couldn’t help it but to snap a photo.


It was a pretty slow day for Saginaw. The whole time I was on the radio with Saginaw approach, I got to hear maybe 5 planes total. The tower was even less busy. There were no planes around except for me and the commuter jet.


This leg of the flight was the hardest. First of all, it was only 29 nm. This equated to about 25 minutes of flying. The first four-ish and the last five-ish were spent climbing and descending, so really there was about 15 minutes of cruising. Not much time to begin with. I flew this leg by following the MBS VOR radial 248. My one and only checkpoint on this leg was mid way — the beginning of a wind turbine farm. It took about 2 minutes longer to get there than planned, but the wind turbines were easy to see from distance so no problems there.

ground track

Following the VOR wasn’t difficult, but you can see in the ground track that I was meandering across it. As expected, it got easier the farther away from the station I got. Here’s the plot of the CDI deflection for this leg. The CSV file says that the units are “fsd” — I have no idea what that means.

CDI deflection

I can’t really draw any conclusions because…well, I don’t know what the graph is telling me. Sure, it seems to get closer and closer to zero (which I assume is a good thing), but I can’t honestly say that I understand what the graph is saying.

The most difficult part was trying to stay at 2500 feet. For whatever reason, it felt like I was flying in sizable thermals. Since there were no thunderstorms in the area, I flew on fighting the updrafts. That was the difficult part. I suspect the wind turbines were built there because the area is windy.


KAMN is a decent size airport. Two plenty long runways for a 172 even on a hot day (5004x75 feet and 3197x75 feet). I didn’t stop by the FBO, so I have no idea how they are. I did not notice anyone else around during the couple of minutes I spent on the ground taxiing and getting ready for the next leg. Maybe it was just the overcast that made people stay indoors. Oh well. It is a nice airport, and I wouldn’t mind stopping there in the future if the need arose.


Flying back to Ann Arbor was the easy part of the trip. The air calmed down enough that once trimmed, the plane more or less stayed at 3500 feet.


It apparently was a slow day for Lansing approach as well, as I got to hear a controller chatting with a pilot of a skydiving plane about how fast the skydivers fell to the ground. Sadly, I didn’t get to hear the end of the conversation since the controller told me to contact Detroit approach.

As far as the ground track is concerned, you can see two places where I stopped flying current heading and instead flew toward the next checkpoint visually. The first instance is a few miles north of KOZW. I spotted the airport, and since I knew I was supposed to overfly it, I turned to it and flew right over it. The second instance is by Whitmore Lake — there I looked into the distance and saw Ann Arbor. Knowing that the airport is on the south side, I just headed right toward it ignoring the planned heading. As I mentioned before in both cases, the planned course was slightly off because the winds weren’t quite like the forecast said they would be.

ground track

You can’t tell from the rather low resolution of the map, but I got to fly right over the Wikipedia article: Michigan stadium. Sadly, I was a bit too busy flying the plane to take a photo of the field below me.


With one solo cross country out of the way, I’m still trying to figure out where I want to go next. Currently, I am considering one of these flights (in no particular order):

path distance time
KARB KGRR KMOP KARB 239 nm 2h19m
KARB KBIV KJXN KARB 220 nm 2h08m
KARB KFWA KTOL KARB 210 nm 2h03m
KARB KMBS KGRR KARB 243 nm 2h21m
KARB KGRR KEKM KARB 266 nm 2h40m

by JeffPC at June 22, 2013 05:23 PM

Benchmark Assumptions

Today I came across a blog post about Running PostgreSQL on Compression-enabled ZFS. I found the article because (1) I am a fan of Wikipedia article: ZFS, and (2) transparent storage compression interests me. (Maybe I’ll talk about the later in the future.)

Whoever ran the benchmark decided to compare ZFS with Wikipedia article: lzjb, ZFS with gzip, against ext3. Their analysis states that ZFS-gzip is faster than ZFS-lzjb, which is faster than ext3. They admit that the benchmark is I/O bound. Then they state that compression effectively speeds up the disk I/O by making every byte transfered contain more information. The analysis goes down the drain right after that.

While doing background research for this blog post we also got a chance to investigate some of the other features besides compression that differentiate ZFS from older file system architectures like ext3. One of the biggest differences is ZFS’s approach to scheduling disk IOs which employs explicit IO priorities, IOP reordering, and deadline scheduling in order to avoid flooding the request queues of disk controllers with pending requests.

Anyone who’s benchmarked a system should have a red flag going off after reading those sentences. My reaction was something along the lines: “What?! You know that there are at least three major differences between ZFS and ext3 in addition to compression and you still try to draw conclusions about compression effectiveness by comparing ZFS with compression against ext3?!”

All they had to do to make their analysis so much more interesting and keep me quiet was to include another set of numbers — ZFS without compression. That way, one can compare ext3 with ZFS-uncompressed to see how much difference the radically different filesystem design makes. Then one could compare ZFS-uncompressed with the lzjb and gzip data to see if compression helps. Based on the data presented, we have no idea if compression helps — we just know that compression and ZFS outperform ext3. What if ZFS without compression is 5x faster than ext3? Then using gzip (~4x faster than ext3) is actually not the fastest.

To be fair, knowing how modern disk drives behave, chances are that compressed ZFS is faster than uncompressed ZFS. Since CPU cycles are so plentiful these days, all my systems have lzjb compression enabled everywhere. I do this mostly to conserve space, but also in hopes of transferring less data to disk. Yes, this is exactly what their benchmark attempts to show. (I haven’t had a chance to experiment with the new-ish lz4 compression algorithm in ZFS.) My point here is solely about benchmark analysis and unfounded (or at least unstated) assumptions found in just about every benchmark out there.

by JeffPC at June 22, 2013 02:15 AM

June 09, 2013

Josef "Jeff" Sipek

Plotting G1000 EGT

It would seem that my two recent posts are getting noticed. On one of them, someone asked for the EGT R code I used.

After I get the CSV file of the SD card, I first clean it up. Currently, I just do it manually using Vim, but in the future I will probably script it. It turns out that Garmin decided to put a header of sorts at the beginning of each CSV. The header includes version and part numbers. I delete it. The next line appears to have units for each of the columns. I delete it as well. The remainder of the file is an almost normal CSV. I say almost normal, because there’s an inordinate number of spaces around the values and commas. I use the power of Vim to remove all the spaces in the whole file by using :%s/ //g. Then I save and quit.

Now that I have a pretty standard looking CSV, I let R do its thing.

> data <- read.csv("munged.csv")
> names(data)
 [1] "LclDate"   "LclTime"   "UTCOfst"   "AtvWpt"    "Latitude"  "Longitude"
 [7] "AltB"      "BaroA"     "AltMSL"    "OAT"       "IAS"       "GndSpd"   
[13] "VSpd"      "Pitch"     "Roll"      "LatAc"     "NormAc"    "HDG"      
[19] "TRK"       "volt1"     "volt2"     "amp1"      "amp2"      "FQtyL"    
[25] "FQtyR"     "E1FFlow"   "E1OilT"    "E1OilP"    "E1RPM"     "E1CHT1"   
[31] "E1CHT2"    "E1CHT3"    "E1CHT4"    "E1EGT1"    "E1EGT2"    "E1EGT3"   
[37] "E1EGT4"    "AltGPS"    "TAS"       "HSIS"      "CRS"       "NAV1"     
[43] "NAV2"      "COM1"      "COM2"      "HCDI"      "VCDI"      "WndSpd"   
[49] "WndDr"     "WptDst"    "WptBrg"    "MagVar"    "AfcsOn"    "RollM"    
[55] "PitchM"    "RollC"     "PichC"     "VSpdG"     "GPSfix"    "HAL"      
[61] "VAL"       "HPLwas"    "HPLfd"     "VPLwas"   

As you can see, there are lots of columns. Before doing any plotting, I like to convert the LclDate, LclTime, and UTCOfst columns into a single Time column. I also get rid of the three individual columns.

> data$Time <- as.POSIXct(paste(data$LclDate, data$LclTime, data$UTCOfst))
> data$LclDate <- NULL
> data$LclTime <- NULL
> data$UTCOfst <- NULL

Now, let’s focus on the EGT values — E1EGT1 through E1EGT4. E1 refers to the first engine (the 172 has only one), I suspect that a G1000 on a twin would have E1 and E2 values. I use the ggplot2 R package to do my graphing. I could pick colors for each of the four EGT lines, but I’m way too lazy and the color selection would not look anywhere near as nice as it should. (Note, if you have only two values to plot, R will use a red-ish and a blue-ish/green-ish color for the lines. Not exactly the smartest selection if your audience may include someone color-blind.) So, instead I let R do the hard work for me. First, I make a new data.frame that contains the time and the EGT values.

> tmp <- data.frame(Time=data$Time, C1=data$E1EGT1, C2=data$E1EGT2,
                    C3=data$E1EGT3, C4=data$E1EGT4)
> head(tmp)
                 Time      C1      C2      C3      C4
1 2013-06-01 14:24:54 1029.81 1016.49 1019.08 1098.67
2 2013-06-01 14:24:54 1029.81 1016.49 1019.08 1098.67
3 2013-06-01 14:24:55 1030.94 1017.57 1019.88 1095.38
4 2013-06-01 14:24:56 1031.92 1019.05 1022.81 1095.84
5 2013-06-01 14:24:57 1033.16 1020.23 1022.82 1092.38
6 2013-06-01 14:24:58 1034.54 1022.33 1023.72 1085.82

Then I use the reshape2 package to reorganize the data.

> library(reshape2)
> tmp <- melt(tmp, "Time","Cylinder")
> head(tmp)
                 Time Cylinder   value
1 2013-06-01 14:24:54       C1 1029.81
2 2013-06-01 14:24:54       C1 1029.81
3 2013-06-01 14:24:55       C1 1030.94
4 2013-06-01 14:24:56       C1 1031.92
5 2013-06-01 14:24:57       C1 1033.16
6 2013-06-01 14:24:58       C1 1034.54

The melt function takes a data.frame along with a name of a column (I specified “Time”), and reshapes the data.frame. For each row, in the original data.frame, it takes all the columns not specified (e.g., not Time), and produces a row for each with a variable name being the column name and the value being that column’s value in the original row. Here’s a small example:

> df <- data.frame(x=c(1,2,3),y=c(4,5,6),z=c(7,8,9))
> df
  x y z
1 1 4 7
2 2 5 8
3 3 6 9
> melt(df, "x")
  x variable value
1 1        y     4
2 2        y     5
3 3        y     6
4 1        z     7
5 2        z     8
6 3        z     9

As you can see, the x values got duplicated since there were two other columns. Anyway, the one difference in my call to melt is the argument. I don’t want my variable name column to be called “variable” — I want it to be called “Cylinder.”

At this point, the data is ready to be plotted.

> library(ggplot2)
> p <- ggplot(tmp)
> p <- p + ggtitle("Exhaust Gas Temperature")
> p <- p + ylab(expression(Temperature~(degree*F)))
> p <- p + geom_line(aes(x=Time, y=value, color=Cylinder))
> print(p)

That’s all there is to it! There may be a better way to do it, but this works for me. I use the same approach to plot the different altitude numbers, the speeds (TAS, IAS, GS), CHT, and fuel quantity.

You can download an R script with the above code here.

by JeffPC at June 09, 2013 07:38 PM

June 02, 2013

Josef "Jeff" Sipek

Garmin G1000 Data Logging: Cross-Country Edition

About a week ago, I talked about G1000 data logging. In that post, I mentioned that cross-country flying would be interesting to visualize. Well, on Friday I got to do a mock pre-solo cross country phase check. I had the G1000 logging the trip.

First of all, the plan was to fly from KARB to KFPK. It’s a 51nm trip. I had four checkpoints. For the purposes of plotting the flight, I had to convert the pencil marks on my sectional chart to latitude and longitude.

> xc_checkpoints
          Name Latitude Longitude
1      Chelsea 42.31667 -84.01667
2       Munith 42.37500 -84.20833
3       Leslie 42.45000 -84.43333
4 Eaton Rapids 42.51667 -84.65833

First of all, let’s take a look at the ground track.

ground track

In addition to just the ground track, I plotted here the first three checkpoints in red, the location of the plane every 5 minutes in blue (excluding all the data points near the airport), and some other places of interest in green.

As you can see, I was always a bit north of where I was supposed to be. Right after passing Leslie, I was told to divert to 69G. I figured out the true course, and tried to take the wind into account, but as you can see it didn’t go all that well at first. When I found myself next to some oil tanks way north of where I wanted to be, I turned southeast…a little bit too much. Eventually, I made it to Richmond which was, much like all grass fields, way too hard to spot. (I’m pretty sure that I will avoid all grass fields while on my solo cross countries.)

So, how about the altitude? The plan was to fly at 4500 feet, but due to clouds being at about 3500, Wikipedia article: pilotage being the purpose of this exercise, and not planning on going all the way to KFPK anyway, we just decided to stay at 3000. At one point, 3000 seemed like a bit too close to the clouds, so I ended up at 2900. Below is the altitude graph. For your convenience, I plotted horizontal lines at 2800, 2900, 3000, and 3100 feet. (Near the end, you can see 4 touch and gos and a full stop at KARB.)


While approaching my second checkpoint, Munith, I realized that it will be pretty hard to find. It’s a tiny little town, but sadly it is the biggest “landmark” around. So, I tuned in the JXN Wikipedia article: VOR and estimated that the 50 degree radial would go through Munith. While that wouldn’t give me my location, it would tell me when I was abeam Munith. Shortly after, I changed my estimate to the 60 degree radial. (It looks like 65 is the right answer.)

> summary(factor(data$NAV1))
109.6 114.3 
 3192  1406 
> summary(factor(data$CRS))
  36   37   42   44   47   48   49   50   52   57   59   60 
1444    1    1    1    1    1    1  135    1    1    1 3010 
> head(subset(data, HSIS=="NAV1")$Time, 1)
[1] "2013-05-31 09:43:23 EDT"
> head(subset(data, NAV1==109.6)$Time, 1)
[1] "2013-05-31 09:43:42 EDT"
> head(subset(data, CRS==50)$Time, 1)
[1] "2013-05-31 09:44:26 EDT"
> head(subset(data, CRS==60)$Time, 1)
[1] "2013-05-31 09:46:48 EDT"

When I got the plane, the NAV1 radio was tuned to 114.3 (SVM) with the 36 degree radial set. At 9:43:25, I switched the input for the HSI from GPS to NAV1; at 9:43:42, I tuned into 109.6 (JXN). 44 seconds later, I had the 50 degree radial set. Over two minutes later, I changed my mind and set the 60 degree radial, which stayed there for the remainder of the flight.

In my previous post about the G1000 data logging abilities, I mentioned that the engine related variables would be more interesting on a cross-country. Let’s take a look.

engine RPM

As you can see, when reaching 3000 feet (cf. the altitude graph) I pulled the power back to a cruise setting. Then I started leaning the mixture.

fuel flow

Interestingly, just pulling the power back causes a large saving of fuel. Leaning helped save about one gallon/hour. While that’s not bad (~11%), it is not as significant as I thought it would be.


Since there was nowhere near as much maneuvering as previously, the fuel quantity graphs look way more useful. Again, we can see that the left tank is being used more.

The cylinder head temperature and exhaust gas temperature graphs are mostly boring. Unlike the previous graphs of CHT and EGT these clearly show a nice 30 minute long period of cruising. To be honest, I thought these graphs would be more interesting. I’ll probably keep plotting them in the future but not share them unless they show something interesting.

cylinder head temperature exhaust gas temperature

Same goes for the oil pressure and temperature graphs. They are kind of dull.

oil pressure oil temperature

Anyway, that’s it for today. Hopefully, next time I’ll try to look at how close the plan was to reality.

by JeffPC at June 02, 2013 07:52 PM

May 26, 2013

Josef "Jeff" Sipek

Garmin G1000 Data Logging

About a month ago I talked about using R for plotting GPS coordinates. Recently I found out that the Wikipedia article: Cessna 172 I fly in has had its G1000 avionics updated. Garmin has added the ability to store various flight data to a CSV file on an SD card every second. Aside from the obvious things such as date, time and GPS latitude/longitude/altitude it stores a ton of other variables. Here is a subset: indicated airspeed, vertical speed, outside air temperature, pitch attitude angle, roll attitude angle, lateral and vertical G forces, the NAV and COM frequencies tuned, wind direction and speed, fuel quantity (for each tank), fuel flow, volts and amps for the two buses, engine RPM, cylinder head temperature, and exhaust gas temperature. Neat, eh? I went for a short flight that was pretty boring as far as a number of these variables are concerned. Logs for cross-country flights will be much more interesting to examine.

With that said, I’m going to have fun with the 1-hour recording I have. If you don’t find plotting time series data interesting, you might want to stop reading now. :)

First of all, let’s take a look at the COM1 and COM2 radio settings.

> unique(data$COM1)
[1] 120.3
> unique(data$COM2)
[1] 134.55 120.30 121.60

Looks like I had 3 unique frequencies tuned into COM2 and only one for COM1. I always try to get the Wikipedia article: ATIS on COM2 (134.55 at KARB), then I switch to the ground frequency (121.6 at KARB). This way, I know that COM2 both receives and transmits. Let’s see how long I’ve been on the ATIS frequency…

> summary(factor(data$COM2))
 120.3  121.6 134.55 
     1   3303     70 

It makes sense, between listening to the ATIS and tuning in the ground, I spend 70 seconds listening to 134.55. The tower frequency (120.3 at KARB) showed up for a second because I switched away from the ATIS only to realize that I didn’t tune in the ground yet. Graphing these values doesn’t make sense.

I didn’t use the NAV radios, so they stayed tuned to 114.3 and 109.6. Those are the Salem and Jackson VORs, respectively. (Whoever used the NAV radios last left these tuned in.)

To keep track of one’s altitude, one must set the Wikipedia article: altimeter to what a nearby weather station says. The setting is in Inches of Mercury. The ATIS said that 30.38 was the setting to use. The altimeter was set to 30.31 when I got it. You can see that it took me a couple of seconds to turn the knob far enough. Again, graphing this variable is pointless. It would be more interesting during a longer flight where the barometric pressure changed a bit.

> summary(factor(data$BaroA))
30.31 30.32 30.36 30.38 
  262     1     1  3110 

Ok, ok… time to make some graphs… First up, let’s take a look at the outside air temperature (in °C).

> summary(data$OAT)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0     6.8    12.2    11.5    16.0    18.5 


In case you didn’t know, the air temperature drops about 2°C every 1000 feet. Given that, you might be already guessing, after I took off, I climbed a couple of thousand feet.


Here, I plotted both the altitude given by the GPS (Wikipedia article: MSL as well as Wikipedia article: WGS84) and the altitude given by the altimeter. You can see that around 12:12, I set the altimeter which caused the indicated altitude to jump up a little bit. Let’s take a look at the difference between the them.

altitude difference

Again, we can see the altimeter setting changing with the sharp ~60 foot jump at about 12:12. The discrepancy between the indicated altitude and the actual (GPS) altitude may be alarming at first, but keep in mind that even though the altimeter may be off from where you truly are, the whole air traffic system plays the same game. In other words, every aircraft and every controller uses the altimeter-based altitudes so there is no confusion. In yet other words, if everyone is off by the same amount, no one gets hurt. :)

Ok! It’s time to look at all the various speeds. The G1000 reports indicated airspeed (IAS), true airspeed (TAS), and ground speed (GS).


We can see the taxiing to and from the runway — ground speed around 10 Wikipedia article: kts. (Note to self, taxi slower.) The ground speed is either more or less than the airspeed depending on the wind speed.

Moving along, let’s examine the lateral and normal accelerations. The normal acceleration is seat pushing “up”, while the lateral acceleration is the side-to-side “sliding in the seat side to side” acceleration. (Note: I am not actually sure which way the G1000 considers negative lateral acceleration.)


Ideally, there is no lateral acceleration. (See Wikipedia article: coordinated flight.) I’m still learning. :)

As you can see, there are several outliers. So, why not look at them! Let’s consider an outlier any point with more than 0.1 G of lateral acceleration. (I chose this values arbitrarily.)

> nrow(subset(data, abs(LatAc) > 0.1))
[1] 41
> nrow(subset(data, abs(LatAc) > 0.1 & AltB < 2000))
[1] 28

As far as lateral acceleration goes, there were only 41 points beyond 0.1 Gs 30 of which were below 2000 feet. (KARB’s pattern altitude is 1800 feet so 2000 should be enough to easily cover any deviation.) Both of these counts however include all the taxiing. A turn during a taxi will result in a lateral acceleration, so let’s ignore all the points when we’re going below 25 kts.

> nrow(subset(data, abs(LatAc) > 0.1 & GndSpd > 25))
[1] 26
> nrow(subset(data, abs(LatAc) > 0.1 & AltB < 2000 & GndSpd > 25))
[1] 13

Much better! Only 26 points total, 13 below 2000 feet. Where did these points happen? (Excuse the low-resolution of the map.) You can also see the path I flew — taking off from runway 6, making a left turn to fly west to the practice area.


The moment I took off, I noticed that the Wikipedia article: thermals were not going to make this a nice smooth ride. I think that’s why there are at least three points right by the highway while I was still climbing out of KARB. The air did get smoother higher up, but it still wasn’t a nice calm flight like the ones I’ve gotten used to during the winter. Looking at the map, I wonder if some of these points were due to abrupt power changes.

Here’s a close-up on the airport. This time, the point color indicates the amount of acceleration.


There are only 4 points displayed. Interestingly, three of the four points are negative. Let’s take a look.

                    Time LatAc  AltB  E1RPM
2594 2013-05-25 12:52:10 -0.11 879.6 2481.1
2846 2013-05-25 12:56:31 -0.13 831.6  895.8
2847 2013-05-25 12:56:32  0.18 831.6  927.4
2865 2013-05-25 12:56:50 -0.13 955.6 2541.5

The middle two are a second apart. Based on the altitude, it looks like the plane was on the ground. Based on the engine RPMs, it looks like it was within a second or two of touchdown. Chances are that it was just nose not quite aligned with the direction of travel. The other two points are likely thermals tossing the plane about a bit — the first point is from about 50 feet above ground the last is from about 120 feet. Ok, I’m curious…

> data[c(2835:2850),c("Time","LatAc","AltB","E1RPM","GndSpd")]
                    Time LatAc  AltB  E1RPM GndSpd
2835 2013-05-25 12:56:20 -0.02 876.6 1427.9  66.71
2836 2013-05-25 12:56:21  0.01 873.6 1077.1  65.71
2837 2013-05-25 12:56:22  0.01 864.6  982.4  64.21
2838 2013-05-25 12:56:23  0.04 861.6  994.1  62.77
2839 2013-05-25 12:56:24  0.01 858.6  982.6  61.54
2840 2013-05-25 12:56:25  0.01 852.6  988.2  60.18
2841 2013-05-25 12:56:26 -0.02 845.6  959.0  58.91
2842 2013-05-25 12:56:27  0.00 846.6  945.5  57.73
2843 2013-05-25 12:56:28  0.01 844.6  930.9  56.53
2844 2013-05-25 12:56:29  0.10 834.6  908.0  55.16
2845 2013-05-25 12:56:30 -0.01 827.6  886.6  54.16
2846 2013-05-25 12:56:31 -0.13 831.6  895.8  52.71
2847 2013-05-25 12:56:32  0.18 831.6  927.4  51.49
2848 2013-05-25 12:56:33 -0.06 831.6  982.0  50.21
2849 2013-05-25 12:56:34  0.05 840.6 1494.0  49.39
2850 2013-05-25 12:56:35 -0.07 833.6 2249.7  48.76

The altitudes look a little out of whack, but otherwise it makes sense. #2835 was probably the time throttle was pulled to idle. Between #2848 and #2849 throttle went full in. Ground was most likely around 832 feet and touchdown was likely at #2846 as I guessed earlier.

Let’s plot the engine related values. First up, engine RPMs.


It is pretty boring. You can see the ~800 during taxi; the 1800 during the runup; the 2500 during takeoff; 2200 during cruise; and after 12:50 you can see the go-around, touch-n-go, and full stop.

Next up, cylinder head temperature (in °F) and exhaust gas temperature (also in °F). Since the plane has a 4 cylinder engine, there are four lines on each graph. As I was maneuvering most of the time, I did not get a chance to try to lean the engine. On a cross country, it be pretty interesting to see the temperature go up as a result of leaning.



Moving on, let’s look at fuel consumption.

fuel quantity

This is really weird. For the longest time, I knew that the plane used more fuel from the left tank, but this is the first time I have solid evidence. (Yes, the fuel selector was on “Both”.) The fuel flow graph is rather boring — it very closely resembles the RPM graph.

fuel flow

Ok, two more engine related plots.

oil temperature

oil pressure

It is mildly interesting that the temperature never really goes down while the pressure seems to be correlated with the RPMs.

There are two variables with the vertical speed — one is GPS based while the other is barometer based.

vertical speed

As you can see, the two appear to be very similar. Let’s take a look at the delta. In addition to just a plain old subtraction, you can see the 60-second moving average.

vertical speed: GPS vs. Barometer

Not very interesting. Even though the two sometimes are off by as much as 560 feet/minute, the differences are very short-lived. Furthermore, the differences are pretty well distributed with half of them being within 50 feet.

> summary(data$VSpd - data$VSpdG)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-559.8000  -49.2800    0.4950    0.8252   53.0600  563.4000 
> summary(SMA(data$VSpd - data$VSpdG),2)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
-240.2000  -22.2200    0.6940    0.8226   25.4700  226.7000         9 

Ok, last but not least the CSV contains the pitch and roll angles. I’ll have to think about what sort of creative analysis I can do. The only thing that jumps to mind is the mediocre S-turn around 12:40 where the roll changed from about 20 degrees to -25 degrees.



I completely ignored the volts and amps variables (for each of the two busses), all the navigation related variables (waypoint identifier, bearing, and distance, Wikipedia article: HSI source, course, Wikipedia article: CDI/Wikipedia article: GS deflection), wind (direction and speed), as well as ground track, magnetic heading and Wikipedia article: variation, GPS fix (it was always 3D), GPS horizontal/vertical alert limit, and WAAS GPS horizontal/vertical protection level (I don’t think the avionics can handle WAAS — the columns were always empty). Additionally, since I wasn’t using the autopilot, a number of the fields are blank (Autopilot On/Off, mode, commands).


A while ago I learned about CloudAhoy. Their iPhone/iPad app uses the GPS to record your flight. Then, they do some number crunching to figure out what kind of maneuvers you were doing. (I contacted them a while ago to see if one could upload a GPS trace instead of using their app, sadly it was not possible. I do not know if that has changed since.) I think it’d be kind of cool to write a (R?) script that’d take the G1000 recording and do similar analysis. The big difference is the ability to use the great number of other variables to evaluate the pilot’s control of the airplane — ranging from coordinated flight and dangerous maneuvers (banking too aggressively while slow), to “did you forget to lean?”.

by JeffPC at May 26, 2013 07:56 PM

May 19, 2013

Justin Dearing

Creating a minimally viable Centos instance for SSH X11 Forwarding

I recently need to setup a CentOS 6.4 vm for development Java development. I wanted to be able to run Eclipse STS and on said vm and display the X11 Windows remotely on my Windows 7 desktop via XMing. I saw no reason for the CentOS VM to have a local X11 server. I’m quite comfortable with the Linux command line. I decided to share briefly on how to go from a CentOS minimal install to something actually useful for getting work done.

  • /usr/bin/man The minimal install installs man pages, but not the man command. This is an odd choice. yum install man will fix that.
  • vim There is a bare bones install of vim included by default that is only accessible via vi. If  you want a more robust version of vim, yum install vim.
  • X11 forwarding You need the xauth package and fonts. yum install xauth will allow X11 forwarding to work. yum groupinstall fonts will install a set of fonts.
  • A terminal for absolute minimal viability yum install xterm will give  you a terminal. I prefer terminator, which is available through rpmforge.
  • RpmForge (now repoforge) Centos is based on Red Hat Enterprise Linux. Therefore it focuses on being a good production server, not a developer environment. You will probably need rpmforge to get some of the packages you want. The directions for adding Rpmforge to your yum repositories are here.
  • terminator This is my terminal emulator of choice. One you added rpmforge, yum install rpmforge
  • gcc, glibc, etc Honestly, you can usually live without these if you stick to precompiled rpms, and you’re not using gcc for development. If you need to build a kernel module, yum install kernel-devel gcc make should get you what out need.

From here, you can install the stuff you need for your development environment for your language, framework, and scm of choice.

by Justin at May 19, 2013 03:35 PM

May 11, 2013

Justin Dearing

When your PowerShell cmdlet doesn’t return anything, use -PassThru

The other day I was mounting an ISO in Windows 8 via the Mount-DiskImage command. Since I was mounting the disk image in a script, I needed to know the drive letter it was mounted to so the script could access the files contained within. However, Mount-DiskImage was not returning anything. I didn’t want to go through the hack of listing drives before and after I mounted the disk image, or explicitly assigning the drive letter. Both would leave me open to race conditions if another drive was mounted by another process while my script ran. I was at a loss for what to do.

Then, I remembered the -PassThru parameter, which I am quite fond of using with Add-Type. See certain cmdlets, like Mount-DiskImage, and Add-Type don’t return pipeline output by default. For Add-Type, this makes sense. You rarely want to see a list of the types you just added, unless your exploring the classes in a DLL from the command like. However, for Mount-DiskImage, defaulting to no output was a questionable decision IMHO.

Now in the case of Mount-DiskImage, -PassThru doesn’t return the drive letter. However, it does return an object that you can pipe to Get-Volume which does return an object with a DriveLetter property. To figure that out, I had to ask on stackoverflow.

tl;dr: If your PowerShell cmdlet doesn’t return any output, try -PassThru. If you need the drive letter of a disk image mounted with Mount-DiskImage, pipe the output through Get-Volume.

For a more in depth treatise of -PassThru, check out this script guy article by Ed Wilson(blog|twitter).

by Justin at May 11, 2013 12:50 AM

Getting the Drive Letter of a disk image mounted with WinCdEmu

In my last post, I talked about mounting disk images in Windows 8. Both Windows 8 and 2012 include native support for mounting ISO images as drives. However, in prior versions of Windows you needed a third party tool to do this. Since I have a preference for open source, my tool of choice before Windows 8 was WinCdEmu. Today, I decided to see if it was possible to determine the drive letter of an ISO mounted by WinCdEMu with PowerShell.

A quick search of the internet revealed that WinCdEmu contained a 32 bit command line tool called batchmnt.exe, and a 64 bit counterpart called batchmnt64.exe. These tools were meant for command line automation. While I knew there would be no .NET libraries in WinCdEmu, I did have hope there would be a COM object I could use with New-Object. Unfortunately, all the COM objects were for Windows Explorer integration and popped up GUIs, so they were inappropriate for automation.

Next I needed to figure out how to use batchmnt. For this I used batchmnt64 /?.

C:\Users\Justin>"C:\Program Files (x86)\WinCDEmu\batchmnt64.exe" /?
BATCHMNT.EXE - WinCDEmu batch mounter.
batchmnt <image file> [<drive letter>] [/wait] - mount image file
batchmnt /unmount <image file>         - unmount image file
batchmnt /unmount <drive letter>:      - unmount image file
batchmnt /check   <image file>         - return drive letter as ERORLEVEL
batchmnt /unmountall                   - unmount all images
batchmnt /list                         - list mounted


Mounting and unmounting are trivial. The /list switch produces some output that I could parse into a PSObject if I so desired. However, what I really found interesting was batchmnt /check. The process returned the drive letter as ERORLEVEL. That means the ExitCode of the batchmnt process. If you ever programmed in a C like language, you know your main function can return an integer. Traditionally 0 means success and a number means failure. However, in this case 0 means the image is not mounted, and a non zero number is the ASCII code of the drive letter. To get that code in PowerShell is simple:

$proc = Start-Process  -Wait `
    "C:\Program Files (x86)\WinCDEmu\batchmnt64.exe" `
    -ArgumentList '/check', '"C:\Users\Justin\SQL Server Media\2008R2\en_sql_server_2008_r2_developer_x86_x64_ia64_dvd_522665.iso"' `
[char] $proc.ExitCode

The Start-Process cmdlet normally returns immediately without output. The -PassThru switch makes it return information about the process it created, and -Wait make the cmdlet wait for the process to exit, so that information includes the exit code. Finally to turn that ASCII code to the drive letter we cast with [char].

by Justin at May 11, 2013 12:47 AM

May 05, 2013

Josef "Jeff" Sipek

Instrument Flying

I was paging through a smart collection in Lightroom, when I came across a batch of photos from early December that I did not share yet. (A smart collection is filter that will only show you photos satisfying a predicate.)

On December 2nd, one of the people I work with (the same person that told me exactly how easy it is to sign up for lessons) told me that he was going up to do a couple of practice instrument approaches to Jackson (KJXN) in the club’s Cessna 182. He then asked if I wanted to go along. I said yes. It was a warm, overcast day…you know, the kind when the weather seems to sap all the motivation out of you. I was going to sit in the back (the other front seat was occupied by another person I work with — also a pilot) and play with my camera. Below are the some of the better shots; there are more in the gallery.

Getting ready to take off:

US-127 and W Berry Rd:

The pilot:

The co-pilot:

On the way back to Ann Arbor (KARB), we climbed to five thousand feet, which took us out of the clouds. Since I was sitting in the back, I was able to swivel around and enjoy the sunset on a completely overcast day. The experience totally made my day. After I get my private pilot certificate, I am definitely going to consider getting instrument rated.

The clouds were very fluffy.

by JeffPC at May 05, 2013 01:41 AM

May 03, 2013

Justin Dearing

Setting the Visual Studio TFS diff and merge tools with PowerShell

I recently wrote this script to let me quickly change the diff and merge tools TFS uses from PowerShell. I plan to make it a module and add it to the StudioShell Contrib package by Jim Christopher (blog|twitter). For now, I share it as a gist and place it on this blog.

The script supports Visual Studio 2008-2012 and the following diff tools:


by Justin at May 03, 2013 02:09 AM

April 28, 2013

Eitan Adler

Pre-Interview NDAs Are Bad

I get quite a few emails from business folk asking me to interview with them or forward their request to other coders I know. Given the volume it isn't feasible to respond affirmatively to all these requests.

If you want to get a coder's attention there are a lot of things you could do, but there is one thing you shouldn't do: require them to sign an NDA before you interview them.

From the candidates point of view:

  1. There are a lot more ideas than qualified candidates.
  2. Its unlikely your idea is original. It doesn't mean anyone else is working on it, just that someone else probably thought of it.
  3. Lets say the candidate was working on a similar, if not identical project. If the candidate fails to continue with you now they have to consult a lawyer to make sure you can't sue them for a project they were working on before
  4. NDAs are hard legal documents and shouldn't be signed without consulting a lawyer. Does the candidate really want to find a lawyer before interviewing with you?
  5. An NDA puts the entire obligation on the candidate. What does the candidate get from you?
From a company founders point of view:
  1. Everyone talks about the companies they interview with to someone. Do you want to be that strange company which made them sign an NDA? It can harm your reputation easily.
  2. NDAs do not stop leaks. They serve to create liability when a leak occurs. Do you want to be the company that sues people that interview with them?

There are some exceptions; for example government and security jobs may require security clearance and an NDA. For more jobs it is possible to determine if a coder is qualified and a good fit without disclosing confidential company secrets.

by Eitan Adler ( at April 28, 2013 10:37 PM