# Planet LILUG

## June 26, 2015

### Josef "Jeff" Sipek

#### 2015-06-26

Who Has Your Back? — An annual report looking at how different major companies react to government requests for data.

Learn Lua in 15 Minutes

Mega-processor — A project to build a micro-processor using discrete transistors.

## June 22, 2015

### Josef "Jeff" Sipek

#### Simple File System

On three or four occasions over the past 4 years, I had a use for simple file system spec. Either to teach people about file systems, or to have a simple file system to implement to learn the idiosyncracies of an operating system’s VFS layer. This is what I came up with back in 2011 when helping a friend learn about file systems.

### Simple File System

The structure is really simple. All multi-byte integers are stored as big endian.

A disk is a linear sequence of blocks. Each block is 512 bytes long. You can read/write a block at a time. The first block on the disk is number 0, the second is 1, etc.

The following is the file system structure. First of all, the file system uses 1024 byte blocks, and therefore you need to issue two disk I/Os to process a file system block worth of data.

The first fs block (disk blocks 0 & 1) is reserved, you should not change it in any way.

The second block contains the superblock:

struct superblock {
uint32_t magic;        /* == 0x42420374 */
uint32_t root_inode;   /* the disk block containing the root inode */
uint32_t nblocks;      /* number of block on the disk */
uint32_t _pad[253];    /* unused (should be '\0' filled) */
};


Starting at the third block is the block allocation map. The most significant bit of the first byte of this block represents fs block 0. The next bit represents block 1, etc.

Each file is represented by an inode. The inode contains a number of data block pointers. The first pointer (blocks[0]) contains the first 1024 bytes of the file, blocks[1] the second, etc. The timestamps are in microseconds since 00:00:00 Jan 1, 1900 UTC.

sturct inode {
uint32_t size;         /* file length in bytes */
uint32_t _pad0;        /* unused (should be 0) */
uint64_t ctime;        /* creation time stamp */
uint64_t mtime;        /* last modification time stamp */
uint16_t nblocks;      /* number of data blocks in this file */
uint16_t _pad1;        /* unused (should be 0) */
uint32_t _pad2;        /* unused (should be 0) */
uint32_t blocks[248];  /* file block ptrs */
};


The root directory is represented by an inode, the data pointed to by this inode’s blocks[] have a special format. They should be treated as arrays of directory entries. The filename is space padded (so, “foo.txt” would be stored as “foo.txt                     ”).

struct direntry {
char fname[28];        /* the filename */
uint32_t inode;        /* the inode block ptr */
};


So, graphically, the file system looks something like:

sb -> rootinode
|-> direntries
|     |-> <"foo.txt", inodeptr>
|     |                 \-> inode
|     |                       |-> data
|     |                       |-> data
|     |                       \-> data
|     |-> <"bar.txt", inodeptr>
|     |                 \-> inode
|     |                       |-> data
|     |                       |-> data
|     |                       \-> data
|     |             .
|     |             .
|     |             .
|     \-> <"xyz.txt", inodeptr>
|                       \-> inode
|                             |-> data
|                             |-> data
|                             \-> data
|                   .
|                   .
|                   .
\-> direntries
|-> <"aaa.txt", inodeptr>
|                 \-> inode
|                       |-> data
|                       |-> data
|                       \-> data
|-> <"bbb.txt", inodeptr>
|                 \-> inode
|                       |-> data
|                       |-> data
|                       \-> data
|             .
|             .
|             .
\-> <"ccc.txt", inodeptr>
\-> inode
|-> data
|-> data
\-> data


## June 16, 2015

### Josef "Jeff" Sipek

#### Tail Call Optimization

I just had an interesting realization about tail call optimization. Often when people talk about it, they simply describe it as an optimization that the compiler does whenever you end a function with a function call whose return value is propagated up as is. Technically this is true. Practically, people use examples like this:

int foo(int increment)
{
if (increment)
return bar() + 1; /* NOT a tail-call */

return bar(); /* a tail-call */
}


It sounds like a very solid example of a tail-call vs. not. I.e., if you “post process” the value before returning it is not a tail-call.

Going back to my realization, I think people often forget about one type of “post processing” — casts. Consider the following code:

extern short bar();

int foo()
{
return bar();
}


Did you spot it? This is not a tail-call.

The integer promotion from short to int is done after bar returns but before foo returns.

For fun, here’s the disassembly:

$gcc -Wall -O2 -c test.c$ dis test.o
...
foo()
foo:     55                 pushl  %ebp
foo+0x1: 89 e5              movl   %esp,%ebp
foo+0x3: 83 ec 08           subl   $0x8,%esp foo+0x6: e8 fc ff ff ff call -0x4 <foo+0x7> foo+0xb: c9 leave foo+0xc: 98 cwtl foo+0xd: c3 ret  For completeness, if we change the return value of bar to int: $ gcc -Wall -O2 -c test.c
$dis test.o ... foo() foo: e9 fc ff ff ff jmp -0x4 <foo+0x1>  I wonder how many people think they are tail-call optimizing, but in reality they are “wasting” a stack frame because of this silly reason. ### Eitan Adler #### Blogging My Way Through CLRS Section 4.1 After another long break of not writing up any CLRS answers here is section 4.1. Question 4.1-1: What does$\textit{Find-Maximum-Subarray}$return when all elements of$A$are negative? The procedure would return the single element of maximum value. This is expected since the maximum subarray must contain at least one element. This can be computed by note that$\textit{Find-Max-Crossing-Subarray}$will always return the array of solely the midpoint and that$\textit{Find-Maximum-Subarray}$always finds the maxium of$\{leftsum, rightsum, and crosssum\}$Question 4.1-2: Write pseudocode for the brute-force method of solving the max-subarray problem. Your solution should run in$\theta(n^2)$time. max_i = nilmax_j = nilmax_sum = -∞for i in 0..len(A): cur_sum = 0 for j in i..len(A): cur_sum += A[j] if cur_sum > max_sum: max_sum = cur_sum max_i = i max_j = jreturn (max_i, max_j, max_sum)  Question 4.1-3: Implement both the brute-force and recursive algorithms for the maximum-subarray problem on your own computer. What problem size$n_0$gives the crossover point at which the recursive algorithm beats the brute-force algorithm? Then, change the base case of the recursive algorithm to use the brute-force algorithm whenever the problem size is less than$n_0$. Does that change the crossover point? This question asks a question that is specific to the implementation, and the computer on which it is run. I will therefore be skipping it in this writeup. It is worthwhile to note that it is almost guarenteed that changing he implementation to use the brute force method for values less than$n_0$is very likely to change$n_0$. Question 4.1-4: Suppose we change the definition of the maximum-subarray problem to allow the result to be an empty subarray, where the sum of the values of an empty subarray is 0. How would you change any of the algorithms that do not allow empty subarrays to permit an empty subarray to be the result? For the brute force algorithm it would be rather trivial to add a check, and if the return max_sum is > 0 return the empty array. For the recursive divide and conquer algorithm is is sufficient to just change the$\textit{Find-Max-Crossing-Subarray}$in a manner similar to the brute force method. If$\textit{Find-Max-Crossing-Subarray}$return the correct value, then$\textit{Find-Maximum-Subarray}$will do the correct thing. Question 4.1-5: Develop a nonrecursive linear-time algorithm for the maximum-subarray problem.[1] If one knows a previous answer to the max-subarray problem for a given prefix of the array than any new element consists of only two cases: being part of the maximum subarray or not being part of the maximum subarray. It is easier to explain with pseudocode: max_start = 0max_end = 0max_sum = A[0]max_with_j = A[0]for j in 1..len(A): # If J is in a maximum-subarray, either j is going to being the maximum on its, or it will will add to the current max max_with_j = max(A[j], max_with_j + x) Determine if J is in a maximum-subarray if max_with_j >= max_sum: max_sum = max_with_j max_end = j #Set the starting value if j is the sole element of a new subarray if max_with_j == A[j]: max_start = jreturn (max_start, max_end, cur_max)  1. The question provides some hints as to the solution of the problem. ## June 15, 2015 ### Josef "Jeff" Sipek #### 2015-06-15 Going under the hood of Inbox — translating Java to ObjC and Javascript Open-sourcing Facebook Infer: Identify bugs before you ship — a static analysis tool for C, ObjC, and Java TrainFever — a game inspired by Transport Tycoon How Computers Work: A Journey Into the Walk-Through Computer Why “Agile” and especially Scrum are terrible — a thorough “rant” about agile and scrum and their effects on overall productivity ## June 11, 2015 ### Josef "Jeff" Sipek #### 2015-06-11 O(n) binary search for when O(log n) is just too fast. A Constructive Look At TempleOS D3 gallery — chock full of neat visualizations. The Apollo Guidance Computer: Architecture and Operation seems to be a nice book. ## May 29, 2015 ### Josef "Jeff" Sipek #### 2015-05-29 I’m going to try something new. Instead of sharing individual links per post as I come across them, I’m going to try to dump them whenever I have enough of them. It does mean that some of these links aren’t as “hot off the press”. Here’s the first batch. How We’re Predicting AI — or Failing To How Typography Shapes Our Perception Of Truth Bitcoin mining on a 55 year old IBM 1401 mainframe: 80 seconds per hash What is the difference between an “aggregate” and other kinds of “modified versions”? SourceForge grabs GIMP for Windows’ account, wraps installer in bundle-pushing adware Speed tape looks like duct tape but isn’t. ## May 14, 2015 ### Josef "Jeff" Sipek #### Interactivity During nightly(1) Every so often, I do a nightly build of illumos on my laptop. This is a long and very CPU intensive process. During the build (which takes about 2.75 hours), the load average is rarely below 8 and most of the time it hovers in the low twenties. (This is a full debug and non-debug build with lint and all the other checking. If I need a build quickly, I can limit it to just what I need and then we’re talking minutes or seconds.) Anyway, as you might imagine this sort of load puts some pressure on the CPUs. As a result, some interactive processes suffer a bit. Specifically, Firefox doesn’t get enough cycles to render the modern web (read: JavaScript cesspool). Instead of suffering for almost three hours, I just change Firefox’s scheduling class from IA (interactive) to RT (real time): # priocntl -s -c RT pgrep firefox  This allows Firefox to preempt just about everything on my laptop. This works because Firefox actually yields the CPU properly. This will probably bite me sometime in the future when I end up on a page with such a busted JavaScript turd that it takes over a CPU and won’t let go. Till then, I have a pretty good workaround. ## May 08, 2015 ### Nate Berry #### Still using Arch on my Thinkpad T410s, but not with KDE anymore Its been a while since I’ve posted anything here so just as an excercise for the fingers I thought I’d post an update about my current machine. I’ve been running Arch on an old Thinkpad T410s for almost a year now ( journalctl says logs started on June 23, 2014). Its an Intel i5 M560 […] ## March 30, 2015 ### Eitan Adler #### FreeBSD SMB Client under OSX Host I recently purchased a new Macbook Pro and wanted to get a FreeBSD Virtual Machine set up in order to continue doing development work on it. Unfortunately, FreeBSD as a guest does not support native folder sharing so I decided to try using a samba mounted. I decided to set up my VM to have two network interfaces: a NATed interface for internet access and a host-only interface for access to SMB and ssh. The NAT networking configuration looks like: NetworkName: FreeBSDNatNetworkIP: 10.0.2.1Network: 10.0.2.0/24IPv6 Enabled: YesIPv6 Prefix:DHCP Enabled: YesEnabled: YesPort-forwarding (ipv4) SSH IPv4:tcp:[]:5022:[10.0.2.4]:22Port-forwarding (ipv6) FreeBSD ssh:tcp:[]:6022:[fd17:625c:f037:2:a00:27ff:fefc:9dab]:22loopback mappings (ipv4) The Host-Only networking configuration looks like: Name: vboxnet0GUID: 786f6276-656e-4074-8000-0a0027000000DHCP: DisabledIPAddress: 192.168.56.1NetworkMask: 255.255.255.0IPV6Address: IPV6NetworkMaskPrefixLength: 0HardwareAddress: 0a:00:27:00:00:00MediumType: EthernetStatus: UpVBoxNetworkName: HostInterfaceNetworking-vboxnet0 The FreeBSD configuration looks like this: The OSX sharing configuration looks like: Unfortunately, when attempting to actually mount the SMB filesystem with: mount_smbfs -I 192.168.56.1 //eax@192.168.56.1/shared_vbox I get the error mount_smbfs: can't get server address: syserr = Operation timed out I tried installing the package net/samba36 and found that I needed the --signing=off flag to let it work: It seems based on this setup and research that FreeBSD can not natively mount an OSX samba share. It might be possible to use sysutils/fusefs-smbnetfs. Other people have recommended NFS or sshfs. ## March 01, 2015 ### Josef "Jeff" Sipek #### Concorde I just came across someone’s blog post full of cool Concorde photos. It’s a hard choice, but my favorite photo is: (The black and white photography and the unusual camera position in these images remind me of the Wernher von Braun photo I posted years ago.) ## February 20, 2015 ### Josef "Jeff" Sipek #### Voldemort Collections: Iterating over a key-value store Aaaand here’s another link that I’ve been sitting on for way too long. I played with Voldemort a bit back in August/September time frame. While playing with it, I found this page which talks about using key-value storage systems for more complex data structures. #### Practical and Portable x86 Recompilation About a month ago, I stumbled across this fascinating blog post. I finally got around to sharing it here on my blahg. ## February 04, 2015 ### Josef "Jeff" Sipek #### Raspberry Pi Bootloader As I mentioned previously, I decided to do some hardware hacking and as a result I ended up with a Raspberry Pi B+. After playing with Raspbian for a bit, I started trying to read up on how the Pi boots. That’s what today’s post is about. ### Standalone UART While searching for information about how the Pi boots, I stumbled across a git repo with a standalone UART program. Conveniently, the repo includes ELF files as well as raw binary images. (This actually made me go “ewww” at first.) Before even running it, I looked at the code and saw that it prints out 0x12345678 followed by the PC of one of the first instructions of the program. Pretty minimal, but quite enough. ### Boot Process Just about everyone on the internet (and their grandmothers!) knows that the Pi has a multi stage boot. First of all, it is the GPU that starts executing first. The ARM core just sits there waiting for a reset. The Pi requires an SD card to boot — on it must be a FAT formatted partition with a couple of files that you can download from the Raspberry Pi Foundation. Here’s the short version of what happens: 1. Some sort of baked-in firmware loads bootcode.bin from the SD card and executes it. 2. bootcode.bin does a bit of setup, and then loads and executes start.elf. 3. start.elf does a whole lot of setup. 1. Reads config.txt which is a text file with a bunch of options. 2. Splits the RAM between the GPU and CPU (with the help of fixup.dat). 3. Loads the kernel and the ramdisk. 4. Loads the device tree file or sets up ATAGs. 5. Resets the ARM core. The ARM core then begins executing the kernel. This sounds pretty nice, right? For the most part, it is. But as they say, the devil’s in the details. ### Booting ELF Files It doesn’t take a lot of reading to figure out that start.elf can deal with kernel files in ELF format. This made me quite happy since I’ve written an ELF loader before for HVF and while it wasn’t difficult, I didn’t want to go through the exercise just to get something booting. So, I copied uart02.elf to the SD card, and made a minimal config file: kernel=uart02.elf  A power-cycle later, I saw the two hex numbers. (Ok, this is a bit of a distortion of reality. It took far too many reboots because I was playing with multiple things at the same time — including U-Boot which turned out to be a total waste of my time.) The PC was not what I expected it to be. It was 0x8080 instead of 0x800c. After a lot of trial and error, I realized that it just so happened that the .text section is 0x74 bytes into the ELF file. Then I had a horrifying thought… start.elf understands ELF files enough to jump to the right instruction but it does nothing to make the contents of the ELF file properly laid out in memory. It just reads the file into memory starting at address 0x8000, gets the start address from the ELF header, converts it into a file offset and jumps to that location in memory. This is wrong. Sure enough, booting uart02.bin printed the right number. So much for avoiding writing my own ELF loader. ### Ramdisk Once I had a reliable way to get code to execute and communicate via the serial port, I started playing with ramdisks. The code I was booting parsed the ATAGs and looked for ATAG_INITRD2 which describes the location of the ramdisk. So, I extended my config to include the ramfsfile parameter to specify the filename of the ramdisk: kernel=foo ramfsfile=initrd  Reboot, aaaand…the code panicked because it couldn’t find ATAG_INITRD2. It was weird, the file was present, no misspellings, etc. After a while of rebooting and trying different things, I decide to use the initramfs option and just pick some arbitrary high address for the ramdisk to be loaded at. The config changed to: kernel=foo initramfs initrd 0x800000  Another reboot later, I was looking at a dump of the ATAG_INITRD2. Everything worked as expected! So, it turns out that the boot loader on the Pi is not capable of picking an address for the initial ramdisk by itself. ### Command line Finally, I just had to experiment with passing a command line string. I created a file called cmdline.txt on the SD card with some text. A reboot later, I saw that the dump of the ATAG_CMDLINE had some cruft prepended. The entire value of the ATAG looked like (with some spaces replaced by line breaks): dma.dmachans=0x7f35 bcm2708_fb.fbwidth=656 bcm2708_fb.fbheight=416 bcm2708.boardrev=0x10 bcm2708.serial=0xbd225074 smsc95xx.macaddr=B8:27:EB:45:74:FA bcm2708_fb.fbswap=1 bcm2708.disk_led_gpio=47 bcm2708.disk_led_active_low=0 sdhci-bcm2708.emmc_clock_freq=250000000 vc_mem.mem_base=0x1ec00000 vc_mem.mem_size=0x20000000 foo  This isn’t exactly the worst thing, but it does mean that the option parsing code has to handle cruft prepended to what it would expect the command line to look. I wish that these settings were passed via a separate ATAG. ## February 01, 2015 ### Josef "Jeff" Sipek #### Raspberry Pi Serial Just so I don’t forget, my PL2303HX based TTL serial cable is supposed to be connected to the Raspberry Pi B+ in the following way:  Wire Color Wire GPIO header Red 5V DC not connected Black GND Pin 6 White RxD Pin 8 (TxD) Green TxD Pin 10 (RxD) Note that even though the power wire (red) is 5V, the signaling wires (white & green) are 3.3V TTL. On Illumos, the Prolific chip uses the usbsprl driver. ## January 31, 2015 ### Josef "Jeff" Sipek #### Raspberry Pi Two weeks ago, I decided to do some hardware hacking. After a bit of reading up on embedded boards, I ended up buying a Raspberry Pi B+. It’s essentially a slightly smaller form factor version of the B, that has more GPIO pins and uses microSD cards instead of SD cards. I hooked it up to the TV and played with Raspbian and RiscOS a little bit. As you may have guessed by now, that was not enough fun for me. I just had to boot a custom OS that talked over serial. :) This of course required some way to connect the Pi to something that can talk serial. But that’s a post for another day. :P This post is going to be about my impression of the Pi, as well as a cute little use I found for it over the past week. ### Impressions The Pi is a rather small board. The B+ is even smaller. A lot has been written about the technical side, so I won’t bother. I was rather impressed with how much punch this little board packs. The hardest part about getting it going was putting it in the case (I got one of those kits because it was cheaper than buying everything separately). The built-in 4-port USB hub ended up quite useful. It allowed me to plug in both a keyboard and a mouse and have NOOBS installing Raspbian and RiscOS within minutes. A quick reboot later, I was at a shell prompt. That’s where the “new toy high” wore off a little. (I know I’ve talked about this with people before — it’s cool to be portable, but it’s also boring since the architecture becomes irrelevant.) I had a shell, and the most creative thing I could think of was to look at /proc/cpuinfo and /proc/meminfo. I do have some thoughts about where the Pi B+ could have been better. The B version used an SD card. The B+ uses a microSD card. I consider this a bit of a regression. I have a bunch of older SD cards and an SD card reader that works well with SD cards. Sadly, this card reader (using a microSD adapter) fails to play nice with the SDXC modernization of SD that all microSD cards seem to use. I have the same issue with other microSD cards, so I’m pretty sure it’s the card reader. This makes updating a bit more of a pain. The other thing I wish the Pi had is a DB9 RS232 connector. I have USB serial dongles that work well, but to talk serial to the Pi one needs to either get a level converter or a TTL serial to USB cable. I ended up getting a cheap USB cable with a fake Prolific chip inside. It works, but I hear Windows users are having a terrible time with evil drivers from Prolific. ### Storm Timelapse A little over a week after getting the Pi in the mail, we got a large storm heading our way. I got the brilliant idea to set up a webcam in an upstairs window. Previously, this would involve digging up an old computer, setting it up by the window, etc. This time, I reached for the Pi. I connected a webcam to one of the USB ports and a cheap WiFi USB adapter to another. A short config later, Raspbian was on the network even though there’s no network drop in sight. I didn’t want to abuse the microSD card for storage of images, so I mounted an NFS share from the storage server in the basement. I had to use the nolock option to make the mount happen. I probably could have figured out why the lock manager was not running, but it was a temporary setup so a “quick hack” was all I did. To capture images from the webcam, I ended up installing fswebcam, a small program that does one thing and does it well. I started up screen, and ran fswebcam with the following config. device /dev/video0 input 0 loop 5 resolution 800x600 timestamp "%Y-%m-%d %H:%M:%S %Z" jpeg 95 save /mnt/webcam/%Y%m%d/%H/0_%Y%m%d_%H%M%S.jpg palette YUYV  Then, downstairs on my laptop, I mounted the same share and watched the files appear every five seconds. I ended up running the webcam for two days. Here’s a couple of stills from the 27th: And here’s a couple from the 28th: I did make a quick timelapse, but I haven’t tried to figure out a reasonable set of codec options to not end up with 300 MB of video. Maybe one day I’ll find a good set of options and upload the video here. Here’s what I used: ffmpeg -framerate 30 -pattern_type glob -i '20150128/*/0_*.jpg' \ -b:v 5000k -g 300 /tmp/out.mp4  Anyway, that’s it for today. I’ll write again about the Pi in the near future — from an OS developer’s perspective. ## January 28, 2015 ### dorgan #### Brining the back-end to the front-end The latest trend in web seems to be moving the back-end forward. JavaScript frameworks like AngularJS and a new Service called Back& are eliminating the back-end/middleware. Taking the pieces of code that either interactive with databases or web services and moving them client side. This of course with the help of templating frameworks such as Mustache and Handlebars to handle page layouts and changing markup. This movement seems to not only be pushing the evolution of ECMA script but also the tools around building these applications npm, node, grunt, bower. From my experience though, his trend though did not start within the last 2 or even 3 years, but 8 years ago with the release of Sencha's ExtJS Version 2.0. ExtJS is a library/framework for building web applications, ExtJS has evolved quite a long way from version 2 to its current version 5 evolving into MVC framework and in its latest incarnation a MVVC framework. The end result for these client side frameworks are that you work with web services and data (JSON, XML). Moving the front-end/middleware forward though can be risky, you cant just move all your backend/middleware code to the front-end, there are some security concerns that can play into this. For example if you never thought to secure your web service because it was only accessible on a private network by your web stack well you don't want to put the security itself into the front-end as it easily discovered by view the source of your javascript files. This is were OAuth type authentication comes into place. An authentication request is made to the OAuth server and if successful the response contains a token that is used in your web service requests and valid for a given period of time. But with any luck this migration to the front-end will continue evolving and reduce the layers needs for creating complex websites, services and or applications. ## January 26, 2015 ### Nate Berry #### Printing to Canon MX860 wirelessly from Arch Linux I’ve been using Arch more often than the Ubuntu machine lately and since I know that my Canon printer is not well supported under Linux I was putting off trying to get it going under Arch. In Ubuntu I was able to use some .deb packages that were hosted on a European Canon website to […] ## January 19, 2015 ### Nate Berry #### Crontab reference This is just a snippet of text I got from here which I keep around and paste into my crontab on any new machine I set up. If I forget to do this I invariably end up googling it again… every time. # CRONTAB REF # from http://www.adminschoice.com/crontab-quick-reference # # * * * * * […] ## December 08, 2014 ### Josef "Jeff" Sipek #### Debugging with mdb Recently, Theo Schlossnagle posted two interesting articles about debugging on Illumos using mdb. They are MDB, CTF, DWARF, and other angelic things, and mdb custom dmods. ## December 06, 2014 ### Josef "Jeff" Sipek #### Inline Assembly & clang Recently I talked about inline assembly with GCC and clang where I pointed out that LLVM seems to produce rather silly machine code. In a comment, a reader asked if this was LLVM’s IR doing this or if it was the machine code generator being silly. I was going to reply there, but the reply got long enough to deserve its own post. I’ve dealt with LLVM’s IR for a couple of months during the fall of 2010. It was both interesting and quite painful. The IR is at the single static assignment level. It assumes that stack space is cheap and infinite. Since it is a SSA form, it has no notion of registers. The optimization passes transform the IR quite a bit and at the end there is very little (if any!) useless code. In other words, I think it is the machine code generation that is responsible for the unnecessary stack frame push and pop. With that said, it is time to experiment. Using the same test program as before, of course: #define _KERNEL #define _ASM_INLINES #include <sys/atomic.h> void test(uint32_t *x) { atomic_inc_32(x); }  ### Emitting LLVM IR Let’s compile it with clang passing in the -emit-llvm option to have it generate test.ll file with the LLVM IR: $ clang -S -emit-llvm -Wall -O2 -m64 test.c


There is a fair amount of “stuff” in the file, but the relevant portions are (line-wrapped by me):

; Function Attrs: nounwind
define void @test(i32* %x) #0 {
entry:
tail call void asm sideeffect "lock; incl $0", "=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1 ret void } attributes #0 = { nounwind uwtable "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }  LLVM’s IR happens to be very short and to the point. The function prologue and epilogue are not expressed as part of IR blob that gets passed to the machine code generator. Note the function attribute no-frame-pointer-elim being true (meaning frame pointer elimination will not happen). Now, let’s add in the -fomit-frame-pointer option. $ clang -S -emit-llvm -Wall -O2 -m64 -fomit-frame-pointer test.c


Now, the relevant IR pieces are:

; Function Attrs: nounwind
define void @test(i32* %x) #0 {
entry:
tail call void asm sideeffect "lock; incl $0", "=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1 ret void } attributes #0 = { nounwind uwtable "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }  The no-frame-pointer-elim attribute changed (from true to false), but the IR of the function itself did not change. (The no-frame-pointer-elim-non-leaf attribute disappeared as well, but it really makes sense since -fomit-frame-pointer is a rather large hammer that just forces frame pointer elimination everywhere and so it doesn’t make sense to differentiate between leaf and non-leaf functions.) So, to answer Steve’s question, the LLVM IR does not include the function prologue and epilogue. This actually makes a lot of sense given that the IR is architecture independent and the exact details of what the prologue has to do are define by the ABIs. ### IR to Assembly We can of course use llc to convert the IR into real 64-bit x86 assembly code. $ llc --march=x86-64 test.ll
$gas -o test.o --64 test.s  Here is the disassembly for clang invocation without -fomit-frame-pointer: test() test: 55 pushq %rbp test+0x1: 48 89 e5 movq %rsp,%rbp test+0x4: f0 ff 07 lock incl (%rdi) test+0x7: 5d popq %rbp test+0x8: c3 ret  And here is the disassembly for clang invocation with -fomit-frame-pointer: test() test: f0 ff 07 lock incl (%rdi) test+0x3: c3 ret  ### Conclusion So, it turns out that my previous post simply stumbled across the fact that GCC and clang have different set of optimizations for -O2. GCC includes -fomit-frame-pointer by default, while clang does not. #### Working with Wide Characters Two weekends ago, I happened to stumble into a situation where I had a use for wide characters. Since I’ve never dealt with them before, it was an interesting experience. I’m hoping to document some of my thoughts and discoveries in this post. As you may have guessed, I am using OpenIndiana for development so excuse me if I happen to stray from straight up POSIX in favor of Illumos-flavored POSIX. The program I was working with happens to read a bunch of strings. It then does some mangling on these strings — specifically, it (1) converts these strings between Unicode and EBCDIC, and (2) at times it needs to uppercase a Unicode character. (Yes, technically the Unicode to EBCDIC conversion is lossy since EBCDIC doesn’t have all possible Unicode characters. Practically, the program only cares about a subset of Unicode characters and those all appear in EBCDIC.) In the past, most of the code I wrote dealt with Unicode by just assuming the world was ASCII. This approach allows UTF-8 to just work in most cases. Assuming you don’t want to mangle the strings in any major way, you’ll be just fine. Concatenation (strcat), ASCII character search (strchr), and substring search (strstr) all work perfectly fine. While other functions will do the wrong thing (e.g., strlen will return number of bytes, not number of characters). Converting an ASCII string to EBCDIC is pretty easy. For each input character (aka. each input byte), do a lookup in a 256-element array. The output is just a concatenation of all the looked up values. This simple approach falls apart if the input is UTF-8. There, some characters (e.g., ö) take up multiple bytes (e.g., c3 b6). Iterating over the input bytes won’t work. One way to deal with this is to process as many bytes as necessary to get a full character (1 for ASCII characters, 2–6 for “non-ASCII” Unicode characters), and then covert/uppercase/whatever it instead of the raw bytes. This sort of hoop-jumping is necessary whenever one wants to process characters instead of bytes. ### wchar_t Another way to deal with this is to store the string as something other than UTF-8. I took this approach. When the program reads in a (UTF-8) string, it promptly converts it into a wide character string. In other words, instead of my strings being char *, they are wchar_t *. On my system, wchar_t is a 32-bit unsigned integer. This trivially makes all Unicode characters the same length — 32 bits. I can go back to assuming that one element of my string corresponds to a single character. I just need to keep in mind that a single character is not one byte. In practice, this means remembering to malloc more memory than before. In other words: wchar_t *a, *b; a = malloc(MAX_LEN); /* WRONG */ b = malloc(sizeof(wchar_t) * MAX_LEN); /* CORRECT */  Uppercasing a character becomes just as easy as it was with plain ol’ ASCII. For example, to uppercase the letter in a string: void uppercase_nth(wchar_t *str, int i) { str[i] = toupper(str[i]); }  There are however some downsides. First and foremost, if you are dealing mostly with ASCII, then your memory footprint may have just quadrupled. (In my case, the program is so small that I don’t care about the memory footprint increase.) Second, you have to deal with a couple of “silly” syntax to make the (C99) compiler realize what it is you are attempting to do. const wchar_t *msg = L"Lorem ipsum"; const wchar_t *letter = L'x';  ### “str” functions Arguably, the most visible change involves the “str” functions. With plain old ASCII strings, you use functions like strlen, strcpy, and strcat to, respectively, get the length, copy a string, and concatenate two strings. These functions assume that each byte is a character and that the string is terminated by a null (8-bit 0) so they do not work in the world of wide characters. (Keep in mind that since ASCII consists of characters with values less than 128, a 32-bit integer with that value will have three null bytes in most characters (assuming ASCII text). On big endian systems, you’ll end up with the empty string, while on little endian systems you’ll end up with a string consisting of just the first character.) Thankfully, there are alternatives to the “str” functions that know how to deal with wide character strings — the “ws” functions. Instead of using strlen, strcpy, and strcat, you want to call wslen, wscpy, and wscat. There are of course more. On Illumos, you can look at the wcstring(3c) manpage for many (but not all!) of them. ### printf & friends Manipulating strings solely with the “str” functions is tedious. Often enough, it is so much simpler to reach for the venerable printf. This is where things get really interesting. The printf family of functions knows how to convert between char * strings and wchar_t * strings. First of all, let’s take a look at snprintf (the same applies to printf and sprintf). Here’s a simple code snippet that dumps a string into a char array. The output is char *, the format string is char *, and the string input is also char *. char output[1024]; char *s = "abc"; snprintf(output, sizeof(output), "foo %s bar\n", s);  One can use %ls to let snprintf know that the corresponding input string is a wide character string. snprintf will do everything the same, except it transparently converts the wide character string into a regular string before outputting it. For example: char output[1024]; wchar_t *s = L"abc"; snprintf(output, sizeof(output), "foo %ls bar\n", s);  Will produce the same output as the previous code snippet. Now, what if you want the output to be a wide character string? Simple, use the wprintf functions! There are fwprintf, wprintf, and swprintf which correspond to fprintf, printf, and snprintf. Do note that the wide-character versions want the format string to be a wide character string. As far as the format string is concerned, the same rules apply as before — %s for char * input and %ls for wchar_t * input: wchar_t output[1024]; wchar_t *s1 = L"abc"; char *s2 = "abc"; swprintf(output, sizeof(output), L"foo %ls %s bar\n", s1, s2);  Caution! In addition to swprintf there is also wsprintf. This one takes the format string in char * but outputs into a wchar_t * buffer. Here’s the same information, in a tabular form. The input string type is always determined by the format string contents — %s for char * input and %ls for wchar_t * input:  Function Output Format string printf, sprintf, snprintf, fprintf char * char * wprintf, swprintf, fwprintf wchar_t * wchar_t * wsprintf wchar_t * char * ### setlocale and Summary Oh, I almost forgot! You should call setlocale before you start using all these features. So, to conclude, it is pretty easy to use wide character strings. • #include <wchar.h> • #include <widec.h> • call setlocale in your main • use wchar_t instead of char • use %ls in format strings instead of %s • use L string literal prefix • beware of wsprintf and swprintf I wouldn’t want to deal with this sort of code on daily basis, but for a random side project it isn’t so bad. I do like the ability to not worry about the encoding — the 1:1 mapping of characters to array elements is really convenient. ## December 01, 2014 ### Josef "Jeff" Sipek #### Delegating mount/umount Privileges Recently, I was doing some file system changes. Obviously, I wanted to run them as an unprivileged user. Unfortunately, the test involved mounting and unmounting a filesystem (tmpfs to be specific). At first I was going to set up a sudo rule to allow mount and umount to run without asking for a password. Then I remembered that I should be able to give the unprivileged user the additional privileges. It turns out that there is only one privilege (sys_mount) necessary to delegate…and it is easy to do! $ usermod -K defaultpriv=basic,sys_mount jeffpc


Then it’s a matter of logging out and back in. We can check using ppriv:

$ppriv$$925: bash flags = <none> E: basic,sys_mount I: basic,sys_mount P: basic,sys_mount L: all  At this point, mounting and unmounting works without sudo or similar user switching: $ mkdir tmp
$mount -F tmpfs none /tmp/tmp$ df -h /tmp/tmp
Filesystem      Size  Used Avail Use% Mounted on
swap            2.6G     0  2.6G   0% /tmp/tmp


## November 27, 2014

### Josef "Jeff" Sipek

#### Inline Assembly & GCC, clang

Recently, I got to write a bit of inline assembly. In the process I got to test my changes by making a small C file which defined test function that called the inline function from the header. Then, I could look at the disassembly to verify all was well.

#define _KERNEL
#define _ASM_INLINES
#include <sys/atomic.h>

void test(uint32_t *x)
{
atomic_inc_32(x);
}


GCC has been my go to complier for a long time now. So, at first I was using it to debug my inline assembly. I compiled the test programs using:

$gcc -Wall -O2 -m64 -c test.c  Disassembling the object file yields the rather obvious: test() test: f0 ff 07 lock incl (%rdi) test+0x3: c3 ret  I can’t think of any way to make it better :) Then, at some point I remembered that Clang/LLVM are pretty good as well. I compiled the same file with clang: $ clang -Wall -O2 -m64 -c test.c


The result was rather disappointing:

test()
test:     55                 pushq  %rbp
test+0x1: 48 89 e5           movq   %rsp,%rbp
test+0x4: f0 ff 07           lock incl (%rdi)
test+0x7: 5d                 popq   %rbp
test+0x8: c3                 ret


For whatever reason, Clang feels the need to push/pop the frame pointer. I did a little bit of searching, and I couldn’t find a way to disable this behavior.

The story for 32-bit output is very similar (just drop the -m64 from the compiler invocation). GCC produced the superior output:

test()
test:     8b 44 24 04        movl   0x4(%esp),%eax
test+0x4: f0 ff 00           lock incl (%eax)
test+0x7: c3                 ret


While Clang still wanted to muck around with the frame pointer.

test()
test:     55                 pushl  %ebp
test+0x1: 89 e5              movl   %esp,%ebp
test+0x3: 8b 45 08           movl   0x8(%ebp),%eax
test+0x6: f0 ff 00           lock incl (%eax)
test+0x9: 5d                 popl   %ebp
test+0xa: c3                 ret


For the curious ones, I’m using GCC 4.8.3 and Clang 3.4.2.

I realize this is a bit of a special case (how often to you make a function that simply calls an inline function?), but it makes me worried about what sort of sub-optimal code Clang produces in other cases.

## November 15, 2014

### Nate Berry

#### Using i3 on my Arch USB flash drive

Back in February I wrote about setting up a bootable USB stick with Arch Linux. At the time I was using it with a Dell laptop, but since then have been running it mainly off an old Thinkpad T410s (with a now totally non-functional power cable and a cracked palmrest) that had been retired from […]

## October 03, 2014

### dorgan

#### MyFancyPrincess.com Ruins Halloween!!

Steer clear of MyFancyPrincess.com.   On July 13, 2014 my mother placed an order for the Anna/Elsa dresses that were on pre-order as she wanted to get the Anna/Elsa dresses for my twin daughters (age 2) so we directed her there as the reviews of the company were great and it pricing was great.  We all understood that this was a pre-order item and would not be receiving it right away.  Some time went by and I wanted to follow up with www.myfancyprincess.com so I asked my mom to forward me the confirmation email so that I could get the order number and follow up with them. So she did and I sent the following email on 8/25/14 to follow up on the order:

I am trying to follow up on the order status for:  myfancyprincess-xxxxx

I know the order was for pre-order, so I just wanted to follow up on the status for the actual items.

On the same day I received the following response:
According to the date you ordered, you are in our third presale shipment which is not due to arrive here at our location until the end of August/early September.  Once we receive the shipment and check it in, then we will ship in the order received.  If there are no further delays you should see a shipping confirmation somewhere around mid to third week of September.

Excellent, a quick reply and an approximate date of when to expect the dresses.  So some more time goes by and we hadn't received the dresses yet, so I called my mom and asked her if she had heard anything, and she had not.  So on September 23, 2014 I sent a quick email:

I just wanted to follow up on this order as we still have not received anything and it is now the end of September.

Another 5 days went by without any type of response, so on 9/26/14 I sent another email as their phone system says the way to get the  quick response is to email them:
I have not received my order not a response from you this week, not quite sure what is going on.
On 10/01/14 still no response, so at this point we call and leave a voice message, as well as send another email:
I am sending yet another email to follow up on this order.  Please its getting close to halloween and it was our understanding that we would have these items by now.....

So no response  for most of the day on 10/02/14 so I send the following email, granted its definitely confrontational, but all I am looking for is a status update:
So yet another week has gone by.  Both my wife and I have called and left messages as well as sent email and NOTHING has been responded to.  This is totally UNACCEPTABLE and I will start a social media campaign soon if I don't hear back.

Also being slightly disgruntled at this point, I figured I would try another contact medium, Facebook.  So I posted a message along the same lines as my previous emails. (Which has now been deleted).  That seem to get their attention and I received the following email reply.
Any details regarding the order are released to the purchaser only.  We see we previously responded to an e-mail but that was a mistake.  Thank you.
To which I responded:

Ok please email the purchaser (my mom) with an update and I will contact her to get the details.
And they comment on my Facebook feed also that they have responded to my email as well as forwarded the information along to my Mom, great an update, we are happy.  My mom then tells me that the email states the dresses will not be shipping for another 1-2 weeks and then we'll receive them 5 days after that.  They also explained that this is not their fault and that it is the fault of their manufacturer/distributor.  So hey what are you going to do, so we just have to wait.  Well it seems that they didn't like some of the negative comments that some of my friends/family put in the thread that I had started with them.  So they deleted the post.  And sent the following email to my mom, the original purchaser:

We have gone ahead and canceled this order.  Order delays from our supplier are not our fault and we will not continue to be bashed publicly for something that is not out fault.  we just spoke to our supplier yesterday and they are the ones delaying, NOT us.  We have explained this just this morning to your husband (I think they meant son) who also tried to publicly shame us for this.  We explained it respectfully and nicely.  Yet you felt the need to once again publicly bash us for what we already explained was not our fault.  We are just as upset over this as you are and have on more then one occasion expressed out disappointment that we are the ones taking al the blame for the delays that are not our fault.  We also gave you an option to switch to other in stock dresses (Double the cost) and instead of e-mailing us to work something out you once again went on our page to publicly express your disappointment  (My mom, the actual purchaser, never posted on the page, my wife did when she saw they deleted my post to their page).  You have every right to be disappointed, but please understand that we did not cause this.  You have been refunded in full and the order is now cancelled.

Well that got me really pissed as they could have use the opportunity to shine in a customer service issue, and they chose not to.  So I looked back at some of their Facebook posts to see if anyone else was complaining on their Facebook page and found a recent one within the last day or two and commented on that posting stating to be careful what they post as if they find it "offensive" or that it is "bashing" them they would cancel your order.  Since that comment their Facebook page is now completely locked down, no commenting, no liking and no posting.

Ultimately they should have been sending status updates on these pre-order items, thats the right thing to do.  They also could have used the Facebook posts to shine in customer service but chose to hide everything in email.

I am sorry but if you are on social media you must take the good and the bad with it.  You can't just delete/hide everything that you don't like you have to use it as a tool to show everyone else how you can treat the customer with respect.

In the end who suffers, my 2 year old twin daughters, as we are really close to Halloween and no one else has these costumes in their size.  My wife just told me that she let my daughter Sarah know that she might have to be something else for Halloween and she started to cry.

Shame on you www.myfancyprincess.com!!

## September 03, 2014

#### Finding the majority element in a stream of numbers

Some time ago I came across the following question.
As input a finite stream stream of numbers is provided. Define an algorithm to find the majority element of the input. The algorithm need not provide a sensible result if no majority element exists. You may assume a transdichotomous memory model.
There are a few definitions which may not be immediately clear:
Stream
A possibly infinite set of data which may not be reused in either the forward or backward direction without explicitly storing it.
Majority element
An element in a set which occurs more than half the time.
Transdichotomous
The integer size is equal to the word size of memory. One does not need to worry about storing partial pieces of integers in separate memory units.
Unfortunately this answer isn't of my own invention, but it is interesting and succinct.

The algorithm (click to view)Using 3 registers the accumulator, the guess and the current element (next):
1. Initialize accumulator to 0
2. Accept the next element of the stream and place it into next. If there are no more elements go to step #7.
3. If accumulator is 0 place next into guess and increment accumulator.
4. Else if guess matches next increment accumulator
5. Else decrement accumulator
6. Go to step 2
7. Return the value in guess as the result

An interesting property of this algorithm is that it can be implemented in $O(n)$ time even on a single tape Turing Machine.

## August 13, 2014

### Josef "Jeff" Sipek

#### Serial Console in a Zone

In the past, I’ve talked about serial consoles. I have described how to set up a serial console on Solaris/OpenIndiana. I’ve talked about Grub’s composite console in Illumos-based distros. This time, I’m going do describe the one trick necessary to get tip(1) in a zone working.

In my case, I am using SmartOS to run my zones. Sadly, SmartOS doesn’t support device pass-through of this sort, so I have to tweak the zone config after I create the zone with vmadm.

Let’s assume that the serial port I want to pass through is /dev/term/a. Passing it through into a zone is as easy as:

[root@isis ~]# zonecfg -z 7cff99f6-2b01-464d-9f72-d0ef16ce48af
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af:device> set match=/dev/term/a
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af:device> end
zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af> commit


At this point, you’ll probably want to reboot the zone (I don’t remember if it is strictly necessary). Once it is back up, you’ll want to get into the zone and point your software of choice at /dev/term/a. It doesn’t matter that you are in a zone. The same configuration rules apply — in my case, it’s the same change to /etc/remote as I described previously.

#### Inlining Atomic Operations

One of the items on my ever growing TODO list (do these ever shrink?) was to see if inlining Illumos’s atomic_* functions would make any difference. (For the record, these functions atomically manipulate variables. You can read more about them in the various man pages — atomic_add, atomic_and, atomic_bits, atomic_cas, atomic_dec, atomic_inc, atomic_or, atomic_swap.) Of course once I looked at the issue deeply enough, I ended up with five cleanup patches. The gist of it is, inlining them caused not only about 1% kernel performance improvement on the benchmarks, but also reduced the kernel size by a couple of kilobytes. You can read all about it in the associated bugs (5042, 5043, 5044, 5045, 5046, 5047) and the patch 0/6 email I sent to the developer list. In this blahg post, I want to talk about how exactly Illumos presents these atomic functions in a stable ABI but at the same time allows for inlines.

### Genesis

It should come as no surprise that the “content” of these functions really needs to be written in assembly. The functions are 100% implemented in assembly in usr/src/common/atomic. There, you will find a directory per architecture. For example, in the amd64 directory, we’ll find the code for a 64-bit atomic increment:

	ENTRY(atomic_inc_64)
ALTENTRY(atomic_inc_ulong)
lock
incq	(%rdi)
ret
SET_SIZE(atomic_inc_ulong)
SET_SIZE(atomic_inc_64)


The ENTRY, ALTENTRY, and SET_SIZE macros are C preprocessor macros to make writing assembly functions semi-sane. Anyway, this code is used by both the kernel as well as userspace. I am going to ignore the userspace side of the picture and talk about the kernel only.

These assembly functions, get mangled by the C preprocessor, and then are fed into the assembler. The object file is then linked into the rest of the kernel. When a module binary references these functions the krtld (linker-loader) wires up those references to this code.

### Inline

Replacing these function with inline functions (using the GNU definition) would be fine as far as all the code in Illumos is concerned. However doing so would remove the actual functions (as well as the symbol table entries) and so the linker would not be able to wire up any references from modules. Since Illumos cares about not breaking existing external modules (both open source and closed source), this simple approach would be a no-go.

### Inline v2

Before I go into the next and final approach, I’m going to make a small detour through C land.

#### extern inline

First off, let’s say that we have a simple function, add, that returns the sum of the two integer arguments, and we keep it in a file called add.c:

#include "add.h"

{
return x + y;
}


In the associated header file, add.h, we may include a prototype like the following to let the compiler know that add exists elsewhere and what types to expect.

extern int add(int, int);


Then, we attempt to call it from a function in, say, test.c:

#include "add.h"

int test()
{
}


Now, let’s turn these two .c files into a .so. We get the obvious result — test calls add:

test()
test:     be 07 00 00 00     movl   $0x7,%esi test+0x5: bf 05 00 00 00 movl$0x5,%edi
test+0xa: e9 b1 fe ff ff     jmp    -0x14f	<0xc90>


And the binary contains both functions:

$/usr/bin/nm test.so | egrep '(Value|test$|add$)' [Index] Value Size Type Bind Other Shndx Name [74] | 3520| 4|FUNC |GLOB |0 |13 |add [65] | 3536| 15|FUNC |GLOB |0 |13 |test  Now suppose that we modify the header file to include the following (assuming GCC’s inline definition): extern int add(int, int); extern inline int add(int a, int b) { return a + b; }  If we compile and link the same .so the same way, that is we feed in the object file with the previously used implementation of add, we’ll get a slightly different binary. The invocation of add will use the inlined version: test() test: b8 0c 00 00 00 movl$0xc,%eax
test+0x5: c3                 ret


But the binary will still include the symbol:

$/usr/bin/nm test.so | egrep '(Value|test$|add)' [Index] Value Size Type Bind Other Shndx Name [72] | 3408| 4|FUNC |GLOB |0 |11 |add [63] | 3424| 6|FUNC |GLOB |0 |11 |test  Neat, eh? #### extern inline atomic what? How does this apply to the atomic functions? Pretty simply. As I pointed out, usr/src/common/atomic contains the pure assembly implementations — these are the functions you’ll always find in the symbol table. The common header file that defines extern prototypes is usr/src/uts/common/sys/atomic.h. Now, the trick. If you look carefully at the header file, you’ll spot a check on line 39. If all the conditions are true (kernel code, GCC, inline assembly is allowed, and x86), we include asm/atomic.h — which lives at usr/src/uts/intel/asm/atomic.h. This is where the extern inline versions of the atomic functions get defined. So, kernel code simply includes <sys/atomic.h>, and if the stars align properly, any atomic function use will get inlined. Phew! This ended up being longer than I expected. :) ## August 12, 2014 ### Josef "Jeff" Sipek #### Grub Composite Console In the past, I’ve described how to get a serial console going on Illumos based systems. If you ever used a serial console in Grub (regardless of the OS you ended up booting), you probably know that telling Grub to output to a serial port causes the VGA console to become totally useless — it’s blank. Well, if you are using Illumos, you are in luck. About 5 months ago, Joyent integrated a “composite console” in Grub. You can read the full description in the bug report/feature request. The short version is: all grub output can be sent to both the VGA console as well as over a serial port. It is very easy to configure. In your menu.lst, change the terminal to composite. For example, this comes from my test box’s config file (omitting the uninteresting bits): serial --unit=0 --speed=115200 terminal composite  Note the use of composite instead of serial. That’s all there is to it. ## August 06, 2014 ### Josef "Jeff" Sipek #### Operating Systems: Three Easy Pieces I just found out that Remzi and Andrea decided to write a textbook about operating systems. This is exciting for several reasons. Here are the top two. First and foremost, the book is free. That’s right, a textbook that is free when every other computer science textbook is easily around100. Why? I’ll let Remzi make the case. Long story short, publishing a textbook isn’t about making money. It is about sharing ideas. You can download it from the textbook’s website.

Second, the book is by Remzi and Andrea. This pair of professors from the University of Wisconsin is responsible for a ton of amazing storage related research. If you don’t believe me, check out their publication track record.

I suppose I should mention that I have read only very little of the book, but I did push it onto the top of my to-read stack and I’m slowly making my way through it. I’ll let you all know how it goes.

## August 04, 2014

### Josef "Jeff" Sipek

#### Lua Compatibility

Phew! Yesterday afternoon, I decided to upgrade my laptop’s OpenIndiana from 151a9 to “Hipster”. I did it in a bit convoluted way, and hopefully I’ll write about that some other day. In the end, I ended up with a fresh install of the OS with X11 and Gnome. If you’ve ever seen my monitors, you know that I do not use Gnome — I use Notion. So, of course I had it install it. Sadly, OpenIndiana doesn’t ship it so it was up to me to compile it. After the usual fight to get a piece of software to compile on Illumos (a number of the Solaris-isms are still visible), I got it installed. A quick gdm login later, Notion threw me into a minimal environment because something was exploding.

After far too many hours of fighting it, searching online, and trying random things, I concluded that it was not Notion’s fault. Rather, it was something on the system. Eventually, I figured it out. Lua 5.2 (which is standard on Hipster) is not compatible with Lua 5.1 (which is standard on 151a9)! Specifically, a number of functions have been removed and the behavior of other functions changed. Not being a Lua expert (I just deal with it whevever I need to change my window manager’s configuration), it took longer than it should but eventually I managed to get Notion working like it should be.

So, what sort of incompatibilies did I have to work around?

loadstring got renamed to load. This is an easy to fix thing, but still a headache especially if you want to support multiple versions of Lua.

### table.maxn

table.maxn got removed. This function returned the largest positive integer key in an associative array (aka. a table) or 0 if there aren’t any. (Lua indexes arrays starting at 1.) The developers decided that it’s so simple that those that want it can write it themselves. Here’s my version:

local function table_maxn(t)
local mn = 0
for k, v in pairs(t) do
if mn < k then
mn = k
end
end
return mn
end


### table.insert

table.insert now checks bounds. There doesn’t appear to be any specific way to get old behavior. In my case, I was lucky. The index/positition for the insertion was one higher than table_maxn returned. So, I could replace:

table.insert(ret, pos, newscreen)


with:

ret[pos] = newscreen


### Final Thougths

I can understand wanting to deprecate old crufty interfaces, but I’m not sure that the Lua developers did it right. I really think they should have marked those interfaces as obsolete, make any use spit out a warning, and then in a couple of years remove it. I think that not doing this, will hurt Lua 5.2’s adoption.

Yes, I believe there is some sort of a compile time option for Lua to get legacy interfaces, but not everyone wants to recompile Lua because the system installed version wasn’t compiled quite the way that would make things Just Work™.

## August 02, 2014

### Josef "Jeff" Sipek

#### Generating Random Data

Over the years, there have been occasions when I needed to generate random data to feed into whatever system. At times, simply using /dev/random or /dev/urandom was sufficient. At other times, I needed to generate random data at a rate that exceeded what I could get out of /dev/random. This morning, I read Chris’s blog entry about his need for generating lots of random data. I decided that I should write my favorite approach so that others can benefit.

The approach is very simple. There are two phases. First, we set up our own random pool. Second we use the random pool. I am going to use an example throughout the rest of this post. Suppose that we want to make repeated 128 kB writes to a block device and we want the data to be random so that the device can’t do anything clever (e.g., compress or dedup). Say that during this benchmark we want to write out 64 GB total. (In other words, we will issue 524288 writes.)

### Setup Phase

During the setup phase, we create a pool of random data. The easiest way is to just read /dev/urandom. Here, we want to read enough data so that the pool is large enough. For our 128kB write example, we’d want at least 1 MB. (I’ll explain the sizing later. I would probably go with something like 8 MB because unless I’m on some sort of limited system, the extra 7 MB of RAM won’t be missed.)

### “Using the Pool” Phase

Now that we have the pool, we can use it to generate random buffers. The basic idea is to pick a random offset into the pool and just grab the bytes starting at that location. In our example, we’d pick a random offset between zero and pool size minus 128 kB, and use the 128 kB at that offset.

In pseudo code:

#define BUF_SIZE	(128 * 1024)
#define POOL_SIZE	(1024 * 1024)

static char pool[POOL_SIZE];

char *generate()
{
return &pool[rand() % (POOL_SIZE - BUF_SIZE)];
}


That’s it! You can of course make it more general and let the caller tell you how many bytes they want:

#define POOL_SIZE	(1024 * 1024)

static char pool[POOL_SIZE];

char *generate(size_t len)
{
return &pool[rand() % (POOL_SIZE - len)];
}


It takes a pretty simple argument to show that even a modest sized pool will be able to generate lots of different random buffers. Let’s say we’re dealing with the 128 kB buffer and 1 MB pool case. The pool can return 128 kB starting at offset 0, or offset 1, or offset 2, … or offset 9175043 (). This means that there are 917504 possible outputs. Recall, that in our example we were planning on writing out 64 GB in total which was 524288 writes.

In other words, we are planning on using less than 58% of the possible outputs from our 1 MB pool to write out 64 GB of random data! (An 8 MB pool would yield 6.3% usage.)

If the length is variable, the math gets more complicated, but in a way we get even better results (i.e., lower usage) because to generate the same buffer we would need have the same offset and length. If the caller supplies random (pseudo-random or based on some distribution) lengths, we’re very unlikely to get the same buffer out of the pool.

### Observations

Some of you may have noticed that we traded generating 128 kB (or a user supplied length) of random data for generating a random integer. There are two options there, either you can use a fast pseudo-random number generator (like the  Mersenne twister), or you can reuse same pool! In other words:

#define POOL_SIZE	(1024 * 1024)

static char pool[POOL_SIZE];
static size_t *ridx = (size_t *) pool;

char *generate(size_t len)
{
if ((uintptr_t) ridx == (uintptr_t)&pool[POOL_SIZE])
ridx = (size_t *) pool;

ridx++;

return &pool[ridx % (POOL_SIZE - len)];
}


I leave it as an exercise for the reader to either make it multi-thread safe, or to make the index passed in similarly to how rand_r takes an argument.

### rand_r Considered Harmful

Since we’re on the topic of random number generation, I thought I’d mention what is already rather widely known fact — libc’s rand and rand_r implementations are terrible. At one point, I tried using them with this approach to generate random buffers, but it didn’t take very long before I got repeats! Caveat emptor.

## July 23, 2014

### Josef "Jeff" Sipek

#### Segment Drivers

Lately, I started poking around the Illumos memory management code. As I’ve done in the past, I decided to use this blahg as a place to document some of my discoveries.

### Memory Layout

In Illumos (and Solaris), address spaces are managed as sets of segments. Each segment has a base address, length, and a number of other properties. This is true for both process memory as well as kernel memory. Do not confuse these segments with  memory segmentation that processors like  x86 provide.

Each process has its own struct as:

> ::pgrep vim
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  10852  10777  10850  10777    101 0x4a004000 ffffff0411e1c0a0 vim
> ffffff0411e1c0a0::print proc_t p_as | ::print struct as a_segtree
a_segtree = {
a_segtree.avl_root = 0xffffff03f7c62ea8
a_segtree.avl_compar = as_segcompar
a_segtree.avl_offset = 0x20
a_segtree.avl_numnodes = 0x18
a_segtree.avl_size = 0x60
}


The kernel address space is maintained in the kas global:

> kas::print a_segtree
a_segtree = {
a_segtree.avl_root = kvseg+0x20
a_segtree.avl_compar = as_segcompar
a_segtree.avl_offset = 0x20
a_segtree.avl_numnodes = 0x9
a_segtree.avl_size = 0x60
}


(Once upon a time this set of segments was a linked list, but for a long while now it has been an AVL tree indexed by the base address.)

Regardless of which address space we’re dealing with, the same rules apply: segments represent contiguous regions within the address space. Each segment can represent a different type of memory. For example, walking the kernel address space segment tree yields nine different segments of four different types (kpm, kmem, kp, and map):

> kas::print a_segtree | ::walk avl | ::printf "%p.%016x %a\n" "struct seg" s_base s_size s_ops
fffffe0000000000.000000031e000000 segkpm_ops
ffffff0000000000.0000000017000000 segkmem_ops
ffffff0017000000.0000000080000000 segkp_ops
ffffff0097000000.00000002fca00000 segkmem_ops
ffffff03d3a00000.0000000004000000 segmap_ops
ffffff03d7a00000.000000fbe8600000 segkmem_ops
ffffffffc0000000.000000003b7fb000 segkmem_ops
fffffffffb800000.0000000000550000 segkmem_ops
ffffffffff800000.0000000000400000 segkmem_ops


### Segment Drivers

Illumos comes with seven different architecture- and platform-independent segment drivers. A segment driver is a “driver” that implements a couple of functions to manage a segment of memory. That is, each segment type can handle page faults, page locking, sync operations, etc. differently.

For example, suppose that a page fault occurs because a process tried to load a value from a page that lacks a page table entry. The platform specific (assembly) fault handling code gets invoked by the processor. After doing a little bit of work, it calls into the generic (C) fault handling code, as_fault. There, the segtree AVL tree is consulted and the corresponding segment’s fault operation gets invoked.

(Solaris Internals lists 12 and 11 segment drivers, respectively, in the two editions.) In Illumos, the seven common segment drivers are:

seg_dev
Most of the time, userspace processes do not need to map devices into their address space. In the rare case when a process does want a device mapped (e.g., Xorg), the dev segment driver maintains that mapping.
seg_kmem
This segment driver maps the kernel heap, module text, and all early boot memory. (code)
seg_kp
In general, kernel memory is not pageable. In the rare case that something can be in kernel pageable memory, this segment is what maintains the anonymous page mappings.
seg_kpm
If possible (you’re on a 64-bit system), the kpm segment driver maps all physical memory into the kernel’s address space. This allows the kernel to not have to set up temporary mappings to operate on physical memory. (code)
seg_map
The map segment driver is a kernel-only higher performance version of the vn segment driver. (See below.)
seg_spt
This segment driver is responsible for maintaining SysV shared memory segments. (Not to be confused with POSIX shared memory.)
seg_vn
Memory mapped files are handled by the vn segment driver. This includes both regular files as well as anonymous memory.

There are also two platform specific segment drivers:

seg_mf (i86xpv only)
This segment driver is only used by dom0 processes (read: Xen) to map pages from other domains.
seg_nf (sparc v9 only)
The header for the file says that it is for non-faulting loads. I don’t actually know what exactly it is for. (And I don’t care enough to dig deeper given that it is Sparc specific.)

### The Reality

This is a lot of different segment drivers. Are all of them used all the time? Well, sort of. The mdb output earlier shows that the (amd64) kernel uses only four different segment drivers (kpm, kmem, kp, and map). A typical userspace process is very boring — it is only made up of vn segments. There are, however, exceptions. For instance, Xorg uses vn and dev. This accounts for six of the seven drivers. The last common segment driver is spt, which provides System V shared memory. (I talked about SysV shared memory previously.) So, on a 64-bit x86 system, all seven common segment drivers are in use.

The story is a bit different on 32-bit kernels. Since a 32-bit system has much smaller address space, the kernel tries to eliminate a number of mappings. Here is the list of segments in a 32-bit kernel:

> kas::print a_segtree | ::walk avl | ::printf "%p %a\n" "struct seg" s_base s_ops
b5802000 segmap_ops
b6800000 segkmem_ops
ef400000 segkmem_ops
fe800000 segkmem_ops
ff000000 segkmem_ops


As you can see, the kp and kpm segments went away. While at first this is surprising, it actually makes perfect sense. When thinking about memory there are two “types” to consider: physical and virtual. In theory, one can have more virtual than physical thanks to the MMU but in reality this is only true on 64-bit systems. The physical memory sizes have outgrown 4 GB a number of years ago and therefore a 32-bit address space can trivially be 100% backed by physical memory. In other words, 32-bit address spaces are tight on virtual memory, while 64-bit address spaces are “tight” on physical memory.

Let’s consider the disappearance of the kp segment on 32-bits. What does kp let us do? It lets us oversubscribe physical memory by backing some virtual memory with disk space. On 32-bit systems we have enough physical memory to back all the virtual memory in the kernel so we don’t need to back some of it by disk. So we have no use for it. (Yes, the kernel still could have paged parts of itself out, but kernel text and data is generally considered important enough to keep it in non-pageable memory. The memory utilization will more than pay for itself by the performance improvement of not having the kernel paged out.)

As I stated before, kpm segments map physical memory into the kernel’s address space for performance reasons (without it the kernel would have to temporarily map a page to access the contents). Therefore, they are good candidates for removal when it comes to slimming down the kernel’s address space demands. (Well, the actual story is the other way… the introduction of 64-bit capable hardware allowed kpm segments to exist to improve kernel performance.)

## July 14, 2014

### Josef "Jeff" Sipek

#### Unix Shared Memory

While investigating whether some memory management code was still in use (I’ll blahg about this in the future), I ended up learning quite a bit about shared memory on Unix systems. Since I managed to run into a couple of non-obvious snags while trying to get a simple test program running, I thought I’d share my findings here for my future self.

All in all, there are three ways to share memory between processes on a modern Unix system.

### System V shm

This is the oldest of the three. First you call shmget to set up a shared memory segment and then you call shmat to map it into your address space. Here’s a quick example that does not do any error checking or cleanup:

void sysv_shm()
{
int ret;
void *ptr;

ret = shmget(0x1234, 4096, IPC_CREAT);
printf("shmget returned %d (%d: %s)\n", ret, errno,
strerror(errno));

ptr = shmat(ret, NULL, SHM_PAGEABLE | SHM_RND);
printf("shmat returned %p (%d: %s)\n", ptr, errno, strerror(errno));
}


What’s so tricky about this? Well, by default Illumos’s shmat will return EPERM unless you are root. This sort of makes sense given how this flavor of shared memory is implemented. (Hint: it’s all in the kernel)

### POSIX shm

As is frequently the case, POSIX came up with a different interface and different semantics for shared memory. Here’s the POSIX shm version of the above function:

void posix_shm()
{
int fd;
void *ptr;

fd = shm_open("/blah", O_RDWR | O_CREAT, 0666);
printf("shm_open returned %d (%d: %s)\n", fd, errno,
strerror(errno));

ftruncate(fd, 4096); /* IMPORTANT! */

ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
printf("mmap returned %p (%d: %s)\n", ptr, errno, strerror(errno));
}


The very important part here is the ftruncate call. Without it, shm_open may create an empty file and mmaping an empty file won’t work very well. (Well, on Illumos mmap succeeds, but you effectively have a 0-length mapping so any loads or stores will result in a SIGBUS. I haven’t tried other OSes.)

Aside from the funny looking path (it must start with a slash, but cannot contain any other slashes), shm_open looks remarkably like the open system call. It turns out that at least on Illumos, shm_open is implemented entirely in libc. The implementation creates a file in /tmp based on the path provided and the file descriptor that it returns is actually a file descriptor for this file in /tmp. For example, “/blah” input translates into “/tmp/.SHMDblah”. (There is a second file “/tmp/.SHMLblah” that doesn’t live very long. I think it is a lock file.) The subsequent mmap call doesn’t have any idea that this file is special in any way.

Does this mean that you can reach around shm_open and manipulate the object directly? Not exactly. POSIX states: “It is unspecified whether the name appears in the file system and is visible to other functions that take pathnames as arguments.”

The big difference between POSIX and SysV shared memory is how you refer to the segment — SysV uses a numeric key, while POSIX uses a path.

### mmap

The last way of sharing memory involves no specialized APIs. It’s just plain ol’ mmap on an open file. For completeness, here’s the function:

void mmap_shm()
{
int fd;
void *ptr;

fd = open("/tmp/blah", O_RDWR | O_CREAT, 0666);
printf("open returned %d (%d: %s)\n", fd, errno, strerror(errno));

ftruncate(fd, 4096); /* IMPORTANT! */

ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
printf("mmap returned %p (%d: %s)\n", ptr, errno, strerror(errno));
}


It is very similar to the POSIX shm code example. As before, we need the ftruncate to make the shared file non-empty.

### pmap

In case you’ve wondered what SysV or POSIX shm segments look like on Illumos, here’s the pmap output for a process that basically runs the first two examples above.

6343:	./a.out
0000000000400000          8K r-x--  /storage/home/jeffpc/src/shm/a.out
0000000000411000          4K rw---  /storage/home/jeffpc/src/shm/a.out
0000000000412000         16K rw---    [ heap ]
FFFFFD7FFF160000          4K rwxs-    [ dism shmid=0x13 ]
FFFFFD7FFF170000          4K rw-s-  /tmp/.SHMDblah
FFFFFD7FFF180000         24K rwx--    [ anon ]
FFFFFD7FFF190000          4K rwx--    [ anon ]
FFFFFD7FFF1A0000       1596K r-x--  /lib/amd64/libc.so.1
FFFFFD7FFF33F000         52K rw---  /lib/amd64/libc.so.1
FFFFFD7FFF34C000          8K rw---  /lib/amd64/libc.so.1
FFFFFD7FFF350000          4K rwx--    [ anon ]
FFFFFD7FFF360000          4K rwx--    [ anon ]
FFFFFD7FFF370000          4K rw---    [ anon ]
FFFFFD7FFF380000          4K rw---    [ anon ]
FFFFFD7FFF390000          4K rwx--    [ anon ]
FFFFFD7FFF393000        348K r-x--  /lib/amd64/ld.so.1
FFFFFD7FFF3FA000         12K rwx--  /lib/amd64/ld.so.1
FFFFFD7FFF3FD000          8K rwx--  /lib/amd64/ld.so.1
FFFFFD7FFFDFD000         12K rw---    [ stack ]
total         2120K


You can see that the POSIX shm file got mapped in the standard way (address FFFFFD7FFF170000). The SysV shm segment is special — it is not a plain old memory map (address FFFFFD7FFF160000).

That’s it for today. I’m going to talk about segment types in the different post in the near future.

## June 06, 2014

### Josef "Jeff" Sipek

#### Moving and Downtime

I’ll be moving my server over the next couple of days. I’m working on an email setup to make sure there’s no interruption there. The website and the blahg will however be down until Wednesday evening. Sorry for any inconvenience this may cause.

## May 24, 2014

### Nate Berry

#### Upgrading the Macbook to Ubuntu 14.04

While I certainly have been enjoying running Arch on usb on the Dell, my main machine is still the old MacBook running Ubuntu. I had some extra time today and thought “hey, I should install steam and grab FTL and Kerbal Space Program” but then promptly decided Id rather do upgrades? Im running Gnome3 not […]

## May 12, 2014

### Justin Lintz

#### Ode to Flickr

Over the last 9 years, I’ve had a few things in my life that have benefited me both professionally and personally that have all tied back to Flickr.

In 2003, Canon had released the first sub-$1000 digital SLR camera, the Canon EOS 300D aka the EOS Digital Rebel. The 300D paved the way for entry level consumers to get involved with the world of digital SLRs, previously the cheapest option was the Canon 10D, available at$2,000. You could now learn photography with full control over your camera and have instant feedback without burning through rolls of film.

I joined Flickr in June of 2005. I was checking the site everyday for months prior, viewing the explore page and wondering how people were taking such incredible photos. How were people getting that “blur” in the background of their photos, and why couldn’t I get that effect with my Nikon Coolpix 2500 camera I had at the time? Researching what made the backgrounds “blurry” (which I later learned is called bokeh, and is related to aperture size), I was introduced into the world of DSLRs.

Each day I would check Flickr and I was realizing my current camera was not going to cut it for taking the types of photos I wanted. Not being able to control the ISO, shutter speed or aperture on my camera was holding me back from learning more about photography. I began researching cameras like crazy on dpreview, reading about Nikon vs Canon. Canon and Nikon both had just released followups to their first DSLRs the 350D and the D70s. I decided I really wanted the 350D and began saving all the money I could, combined with some money I got from my 21st birthday, I was able to finally purchase my first “real” camera in June of 2005.

I joined the original “Delete Me” group and took part in getting my photos torn apart with no filter on the critiques. I loved all the sub-communities Flickr had created from the groups and would spend hours reading discussions and browsing photos in them. I learned about famous photographers and started to begin to appreciate what made their work special. I fell in love with street photography even though to this day I still can’t get over being comfortable taking a strangers photo.

After a year of uploading photos I was fortunate enough that people started reaching out to me for permission to use some of my photos. Small things at first, a Christmas music album cover and marketing materials . My first “big” break came in February 2010, CBS Sunday Morning news contacted me to use a photo I took of Joe Ades, a street salesmen who sold peelers in Union Square. He had passed away and they were doing a feature on him, I granted them the rights to use my photo in the story and it aired on CBS.

A year later, the High Line park in NYC would contact me to ask for my photo to be used for their annual fundraising gala. It was used as the background for the official invitation and was blown up to cover the walls behind the bars at the event. They granted me two tickets in exchange for the photo. It was an amazing feeling to see a photo I took on display like that. Fast forward a year later, the High Line contacted me again to use my photo in a book that was being written about the High Line, during this time, they also separately were having a contest for a photo be chosen for sale in their store with proceeds of the sale going towards the maintenance of the park. That same photo eventually won the contest and is still available for sale on the High Line today (although my mom and girlfriend just bought up all the copies at the park store during our Mother’s day visit).

After graduating college in 2006 I briefly toyed with the idea of moving out to San Francisco to try and get a job working at Flickr, possibly doing web development. I wasn’t really sure what direction I wanted to take my Computer Science degree. Instead I stayed in NY, and started working as a Systems Administrator. In 2009, I came across a talk from John Allspaw and Paul Hammond, who were then the head of the Operations and Engineering group at Flickr. The talk was at O’Reilly’s Velocity conference and was about how Flickr’s operation’s and engineering teams worked together. I had no idea a community existed of operations folks, nor did I know there were even conferences dedicated to what I did for a living. Watching that talk completely changed my outlook on my job as a system’s administrator. I realized there was so much more I could be doing to make myself better at my job and to help those around me at work. That talk was my first view into the web operations community and from there I started reading about what other people were doing in my field and how they were solving problems. Several years later I got to attend my first Velocity conference out in Santa Clara and it was awesome. So thank you John and Paul for doing your talk. In 2012, Paul did another great talk on “Infrastructure for Startups” , that again, completely resonated with me.

A couple years after John and Paul gave their talk at Velocity, John moved out to NY to take a job at Etsy as the VP of Operations.

While working at Bitly a couple of years ago, we hired Matthew Rothenberg aka “mroth” who was previously the head of product at Flickr. I think he spent the first couple of weeks at Bitly just answering my fan-boy questions about Flickr. He even got my account hooked up with a beta feature at the time. Mroth introduced me to John who was nice enough to grab lunch with me at Etsy and give me some great career advice along with Mike Rembetsy (One of the people responsible for hiring me at my first job out of college).

Flickr gave me a hobby I may not have ever enjoyed as much as I do now and provided a platform that led to others to be able to enjoy my work. The lessons shared by the people working at Flickr made me better at my job and introduced me to a community I didn’t know existed at the time.

## May 01, 2014

### Josef "Jeff" Sipek

For a couple of years now, I wished that I could have a mini-batch system on my computers that’d let me submit jobs and they’d execute when the resources became available. This would let me queue up large amount of work and it’d eventually all get processed. I even tried to hack up a dumb little Python script that’d loop over a file executing no more than one per core.

Then, yesterday, I stumbled across Task Spooler. It’s exactly what I was looking for! It lets me queue jobs, supports dependencies between jobs, etc.

I’m hoping to experiment with it in the next couple of days. I’ll let you know how it turns out.

#### Bugs in Time

Recently, I blahgd about GCC optimizing code interestingly. There, I mentioned a couple of bugs I’ve stumbled across. I’m going to talk more about them in this post.

### ddi_get_time

It all started when I got assigned a bug at work. “The installer hangs while checking available disks.” That’s the extent of the information I was given along with a test system. It didn’t take long to figure that devfsadm -c disk was waiting on a kernel thread that didn’t seem to be making any progress:

swtch+0x141
cv_timedwait_hires+0xec
cv_reltimedwait+0x51
ibdmibdm_ibnex_port_settle_wait+0x5f
ibibnex_bus_config+0x1e8
devi_config_common+0xa5


The function of interest here is ibdm_ibnex_port_settle, but before I talk about it I need to mention that the ibdm kmod stashes a ddi_get_time timestamp of when the HCA attached. Now, ibdm_ibnex_port_settle calls ibdm_get_waittime to get a delay to feed to cv_reltimedwait. The delay is (more or less) calculated as: ddi_get_time() - hca_attach_time. This works fine as long as ddi_get_time continues incrementing at a constant rate (1 sec/sec).

You may already see where this is going. The problem is that ddi_get_time returns a Unix timestamp based on the current time-of-day clock. If the TOD setting changes for whatever reason (daylight saving time adjustments, NTP, etc.), the value returned by ddi_get_time may change non-monotonically. This makes it unsuitable for calculating timeouts and wait times. Converting ibdm_get_waittime to use a monotonic clock source (like gethrtime or ddi_get_lbolt) fixes this bug. (Illumos bug 4777)

Things get a bit worse. While figuring out what ddi_get_time does, I noticed that the man page actively encouraged developers to use it for timeouts. (Illumos bug 4776)

Of course, once I knew about this potential abuse, I had to check that there weren’t similar issues elsewhere in the kernel… and so I got to file bugs for iprb (4778), vhci (4779), COMSTAR iSCSI target (4780), sd (4781), usba (4782), emlxs (4786), ipf (4787), mac (4788), amr (4789), arcmsr (4790), aac (4791), and heci (4792).

I’m fixing all except: amr, arcmsr, aac, and heci.

### NANOSEC

While developing the series of fixes mentioned in the previous section, I ran into the fact that NANOSEC was defined as 1000000000. This made it an int — a 32-bit signed integer (on both ILP32 and LP64).

If NANOSEC (defined this way) is used to convert seconds to nanoseconds (by multiplying), the naive approach will fail with quantities larger than 2 seconds. For example (hrtime_t is a 64-bit signed int):

hrtime_t convert(int secs)
{
return (secs * NANOSEC);
}


Since both secs and NANOSEC are integers, the compiler will compute the product and then sign extend the result to 64-bits. If you look around the Illumos codebase, you’ll see plenty of places that cast or use ULL or LL suffix to make the compiler do the right thing. Why not just change the definition of NANOSEC to include a LL suffix releaving the users of this tedious (and error prone!) duty? Well, now you know what Illumos bug 4809 is about. :)

So, I changed the definition and rebuilt everything. Then, using wsdiff (think: recursive diff that understands how to compare ELF files) I found two places where the before and after binaries differed for non-trivial reasons. (I define a trivial reason as “the compiler decided to use registers differently, but the result is the same”.) Each non-trivial difference implies that there was an expression that changed — it used to be busted!

The first difference was in ZFS (Illumos bug 4810). There, spa_async_tasks_pending miscalculated a timeout making the condition always true.

The second difference was in in.mpathd. 4811). This daemon has a utility function to convert a struct timeval into a hrtime_t. You can read more about it in my previous post.

Before the NANOSEC change, I would have needed casts to fix this. With the change in definition, I don’t have to change a thing! And that’s how a one liner closed three bugs at the same time:

commit b59e2127f21675e88c58a4dd924bc55eeb83c7a6
Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Date:   Mon Apr 28 15:53:04 2014 -0400

4809 NANOSEC should be 'long long' to avoid integer overflow bugs
4810 spa_async_tasks_pending suffers from an integer overflow bug
4811 in.mpathd: tv2ns suffers from an integer overflow bug
Reviewed by: Marcel Telka <marcel.telka@nexenta.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>


## April 25, 2014

### Josef "Jeff" Sipek

#### GCC Optimizations

Recently, I’ve been given a hang bug to work on. This lead me to a another bug related to timing which pushed me to clean up a time related #define which uncovered at least two bugs. Got all that? Good. The rest of this post is going to talk about the changed define, and one of the “at least two bugs”. When I talk about GCC, I’m talking about the Illumos-specific GCC version 4.4.4. (Illumos needs a couple of features that stock GCC doesn’t provide.)

The #define change I’m hoping to make is very simple:

diff --git a/usr/src/uts/common/sys/time.h b/usr/src/uts/common/sys/time.h
--- a/usr/src/uts/common/sys/time.h
+++ b/usr/src/uts/common/sys/time.h
@@ -234,7 +234,7 @@ struct itimerval32 {
#define        SEC             1
#define        MILLISEC        1000
#define        MICROSEC        1000000
-#define        NANOSEC         1000000000
+#define        NANOSEC         1000000000ll

#define        MSEC2NSEC(m)    ((hrtime_t)(m) * (NANOSEC / MILLISEC))
#define        NSEC2MSEC(n)    ((n) / (NANOSEC / MILLISEC))


Without it, multiplying by NANOSEC will cause integer overflow issues on IPL32 and LP64 systems (read: basically everywhere).

One of the “at least two bugs“ involves a simple (buggy) function aptly named tv2ns as it converts a struct timeval to a 64-bit nanosecond count:

static int64_t
tv2ns(struct timeval *tvp)
{
return (tvp->tv_sec * NANOSEC + tvp->tv_usec * 1000);
}


At first glance, this function looks correct. The only flaw with it is that first portion of the expression multiplies a time_t (32-bit signed int) with an int (also 32-bit signed) making the result of that subexpression 32-bit signed expression. With NANOSEC changed to a long long, everything works as expected. Now, the fun part… disassembling this function without the fix. You don’t have to be an expert to see that this function is strangely repetitive. I’ve annotated the assembly.

tv2ns:          movl   0x4(%esp),%eax     ; eax = tvp
tv2ns+4:        movl   0x4(%eax),%edx     ; edx = tvp->tv_usec
tv2ns+7:        leal   (%edx,%edx,4),%edx ; edx = edx + 4 * edx
tv2ns+0xa:      leal   (%edx,%edx,4),%edx ;     = 5 * edx
tv2ns+0xd:      leal   (%edx,%edx,4),%edx
;
; at this point:  edx = 5 * 5 * 5 * tvp->tv_usec,
; which is the same as: 125 * tvp->tv_usec
;
tv2ns+0x10:     movl   (%eax),%eax        ; eax = tvp->tv_sec
tv2ns+0x12:     leal   (%eax,%eax,4),%eax ; eax = eax + 4 * eax
tv2ns+0x15:     leal   (%eax,%eax,4),%eax ;     = 5 * eax
tv2ns+0x18:     leal   (%eax,%eax,4),%eax
tv2ns+0x1b:     leal   (%eax,%eax,4),%eax
tv2ns+0x1e:     leal   (%eax,%eax,4),%eax
tv2ns+0x21:     leal   (%eax,%eax,4),%eax
tv2ns+0x24:     leal   (%eax,%eax,4),%eax
tv2ns+0x27:     leal   (%eax,%eax,4),%eax
tv2ns+0x2a:     leal   (%eax,%eax,4),%eax
;
; at this point,  eax = 5 * 5 * 5 * 5 * 5 * 5 * 5 * 5 * 5 * tvp->tv_sec,
; which is the same as: 1953125 * tvp->tv_sec
;
tv2ns+0x2d:     shll   $0x9,%eax ; eax <<= 9 ; ; eax = (1953125 * tvp->tv_sec) << 9, ; which suprprisingly ends up being the same as: 1000000000 * tvp->tv_sec ; ; so, now we have 'eax' with the tv_sec converted to nanoseconds and 'edx' ; with 125 * tv_usec ; tv2ns+0x30: leal (%eax,%edx,8),%eax ; eax = eax + 8 * edx ; ; 8 * 125 = 1000, which is the factor to convert tv_usec to nanoseconds! ; tv2ns+0x33: cltd ; sign-extend eax to edx:eax tv2ns+0x34: ret  I found it interesting that GCC decided to emit leal instructions to multiply by 5 and then finish it off with a shift and another leal. This is another one of those times when I realize that the compiler is smarter than me. (The sign-extension of course happens too late — all the math needs to happen as 64-bit arithmetic, but that’s not GCC’s fault.) For the record, with the #define changed, the function looks like the following — sorry, no comments on this one: tv2ns: pushl %edi tv2ns+1: pushl %esi tv2ns+2: pushl %ebx tv2ns+3: subl$0x8,%esp
tv2ns+6:        movl   0x18(%esp),%ecx
tv2ns+0xa:      movl   0x4(%ecx),%eax
tv2ns+0xd:      leal   (%eax,%eax,4),%eax
tv2ns+0x10:     leal   (%eax,%eax,4),%eax
tv2ns+0x13:     leal   (%eax,%eax,4),%ebx
tv2ns+0x16:     shll   $0x3,%ebx tv2ns+0x19: movl %ebx,%esi tv2ns+0x1b: sarl$0x1f,%esi
tv2ns+0x1e:     movl   $0x3b9aca00,%edi tv2ns+0x23: movl (%ecx),%eax tv2ns+0x25: imull %edi tv2ns+0x27: movl %eax,(%esp) tv2ns+0x2a: movl %edx,0x4(%esp) tv2ns+0x2e: addl %ebx,%eax tv2ns+0x30: adcl %esi,%edx tv2ns+0x32: addl$0x8,%esp
tv2ns+0x35:     popl   %ebx
tv2ns+0x36:     popl   %esi
tv2ns+0x37:     popl   %edi
tv2ns+0x38:     ret


Maybe one day I’ll rummage through my brain and dig up other times that GCC is outsmarted me and blahg about them. :)

## April 10, 2014

### Justin Dearing

#### The case for open sourcing the SQL Saturday Website

My name is Justin Dearing. I write software for a living. I also write software for free as hobby and for personal development. When I’m not writing code, I speak at user groups, events and conferences about code and code related topics. Once such event is SQL Saturday. I haven’t spoken in a while because I became a dad in June. However, my daughter is 9 months old now and the weather is warm. I feel comfortable attending a regional SQL Saturday or two.

So last night I submitted to SQL Saturday Philadelphia. The submission process (I mean the mechanical process of using the website to submit my abstract) was annoying, as usual. What really got me going though was when I realized two things:

• My newlines were not being preserved so that my asterisks that were supposed to punctuate bullet points were not at the beginnings of lines.
• I could not edit my submission once submitted.

I like bullet points, a lot. However, I digress. In response to my anger, I complained on twitter that the site should be open sourced, so I the end user could create a better experience for myself and my fellow SQL Saturday Speakers.

I got three retweets. At least I wasn’t completely alone in my sentiment. I complained again in the morning, started a conversation and eventually Tim sent this out this:

So the site was being rewritten, but it would not be open sourced.

Should I have been happy at that point, or at least patiently await the changes? One could presume that session editing and submission would be improved. At the very least, things would get progressively better as there were revisions to the code. If the federal government could pull off the ObamaCare site, with some hiccups, why can’t a group of DBAs launch a much smaller website, with much simpler requirements and lower load?

I’d be willing to bet they will. I’d be willing to bet that this site will suck a lot less than the old site, and that it will continue to progress. I’m sure smart people are working on it, and a passionate BoD are guiding the process. At the very least I’ll withhold judgement until the new site is live.

Despite my confidence in the skills of the unknown (to me) parties working on the site, there are so many hours in the day and only so many things a team of finite size can do. However, a sizable minority of PASS’s membership are .NET developers. Many of them speak at SQL Saturdays. They have to submit to the site. Some of them will no doubt be annoyed at some aspect of the site. Some of them might fix that annoyance, or scratch their itch in OSS parlance, if the site was open source and there was a process to accept pull requests.

I’m not describing a hypothetical nirvana. I’ve seen the process I describe work because I’m submitted a lot of patches to a lot of OSS projects. I’ve submitted a patch to the (not actually open source, as Brent will be the first to state) sp_blitz and Brent accepted it. I’ve contributed to NancyFX. I once contributed a small patch to PHP to make it consume WCF services better. I’ve contributed to several other OSS projects as well.

Perhaps your saying SQL Server is a Microsoft product, not some hippie Linux thing. Perhaps you share the same sentiment as Noel McKinney:

However, as I pointed out to Noel, the mothership’s (i.e, Microsoft’s Editors Note: Noel has stated to me he meant Microsoft) beliefs are not anti OSS. Microsoft has fully embraced Open Source. You can become an MVP purely for OSS without any speaking or forum contributions. One of the authors of NancyFX is an example of such a recipient. F#, ASP.NET and Entity Framework are all open source. Just this week Microsoft Open Sourced Roslyn. As a matter of fact I’ve even submitted a patch to the nuget gallery website, which is operated by Microsoft and owned by the OuterCurve foundation. The patch was accepted and my code, along with the code of others was pushed to nuget.org. So I’ve already submitted source code for a website owned and operated by an independent organization  setup by Microsoft, they’ve already accepted it, and the world seems a slightly better place as a result.

So I ask the PASS BoD to consider releasing the SQL Saturday Website source code on github, and I ask the members of PASS to ask their BoD to release the source code as well.

## April 07, 2014

### Josef "Jeff" Sipek

#### Happy 50th, System/360

It’s been a while since I blahged about mainframes. Rest assured, I’m still a huge fan, I’m just preoccupied with other things to continuously extoll their virtues.

The reason I’m writing today is because it is the 50th anniversary of the System/360 announcement. Aside from the “50 years already?” sentiment, I have a couple of images to share. (I found these several years ago on someone’s GeoCities site. It’s a good thing I made a mirror :) )

I also came across this video from 1964:

## March 24, 2014

### Josef "Jeff" Sipek

#### Netflix Chaos Monkey

Somehow, I managed to miss that about two years ago Netflix open sourced their chaos monkey.

Based on my quick look over the code, it appears to be written in Java. Meh. Regardless of the language, it’s great to see large companies open source their code.

## March 02, 2014

### Josef "Jeff" Sipek

#### Comment Spam Filtering Experiments

Just a heads up, I’m getting fed up with all the comment spam that ends up on the moderation queue. So, I’m working on some code to reject comment spam before it hits it. As the title for this post implies, these are experiments; I’ll try my best not to reject any valid comments. I appologize if a valid comment does get rejected.

If you end up being a victim of my overzealous filters, please email me: jeffpc@josefsipek.net.

## February 22, 2014

### Josef "Jeff" Sipek

#### Greetings from Nexenta

In case you missed it, back in mid-2011 I discovered Illumos and OpenIndiana. At that point, I already missed hacking on the (Linux) kernel. Based on my blahg posts [1,2], it shouldn’t surprise you that it didn’t take long before I wanted to hack on the Illumos kernel…and so I did.

If you ever contributed to an open source project in your free time while employed full-time, you understand that there’s only so much time you can devote to the open source project and therefore there is only so much you can do.

A couple of months ago, I decided to explore the possibility of working full-time on Illumos. There are only a handful of companies that visibly participate in the Illumos ecosystem, but their use of Illumos is pretty varied (from public clouds to virtualized databases to SAN/NAS appliances). As of this past Tuesday (Monday was a holiday), I’m at Nexenta. At least for now, I’m working remotely (from Ann Arbor) with the fine folks in the  Lowell office. It feels great to work on open source again.

## February 16, 2014

### Nate Berry

#### Arch Linux on bootable, persistent USB drive

I recently got a new laptop from work. Its a refurbished Dell Latitude E6330 with an Intel Core i5 processor, a 13″ screen and a 120GB SSD drive that came with Windows 7 Pro. I haven’t used Windows regularly in quite some time (I’ve been using a WinXP VM on the rare occassion I need […]

### Justin Dearing

#### Creating a minimally viable CentOS OpenLogic rapache instance

Recently I’ve been dealing with R and rapache at work. R is a language for statisticians. rapache is an apache module for executing R scripts in apache. Its like mod_perl or mod_php for R. I’ve been writing simple RESTful scripts that return graphics and JSON, and calling them from static html pages. I’ve been also using my MSDN Azure subscription to engage in R self study at home. In the spirit of my last post, I’ve posted the setup notes here to get you stated with a new Azure VM for running an rapache instance. Azure used a special cloud enabled version fo CentoS 6.3 called OpenLogic. However, it seems to work similarly to the vanilla CentoOS 6.4 instances I’ve used at work. So everything should apply there. If something doesn’t work leave a comment.

• First, CentOS is very conservative, but Fedora makes EPEL to give you a more modern set of RPMs
• rpm -Uvh http://epel.mirror.freedomvoice.com/6/i386/epel-release-6-8.noarch.rpm
• Now lets install the packages we need. The kernel will be updated, so we will need to reboot.
• yum update -y
• yum install -y vim-x11 vim-enhanced xauth R terminator xterm rxvt R httpd git httpd-devel gcc cairo cairo-devel libXt-devel
• yum groupinstall -y fonts
• ldconfig
• shutdown -r now
• Now as a regular user lets compile rapache.
• mkdir ~/src
• cd ~/src
• git checkout https://github.com/jeffreyhorner/rapache.git
• cd rapache
• ./configure && make && sudo make install
• Now lets configure rapache. Create a file called /etc/httpd/conf.d/rapache.conf with the following:
# rapache configuration by Justin Dearing <zippy1981@gmail.com>
<Location /RApacheInfo>
SetHandler r-info
</Location>
RHandler sys.source
• Now restart apache.  Make sure it’t working by running
elinks http://localhost/RApacheInfo.

Azure doesn’t configure swap space by default. You’re going to absolutely need some swap space if you’re using an extra small instance. A good howto for that is here.

## February 15, 2014

### Justin Lintz

#### Pager Huety

For a hack week project at Chartbeat, I hooked my Philip’s Hue light bulbs into PagerDuty so whenever I get paged my lights will start flashing. Read about the hack over on the PagerDuty blog

#### Lessons learned tuning TCP and Nginx in EC2 at Chartbeat

I wrote a couple blog posts for work diving into what we learned optimizing our performance in EC2.

You can read both parts over at

http://engineering.chartbeat.com/2014/02/12/part-2-lessons-learned-tuning-tcp-and-nginx-in-ec2/

## January 07, 2014

### Josef "Jeff" Sipek

Ever wonder how Google gets its traffic information?

Apparently, there are two sources. The first is the Department of Transportation. The second consists of Android users.

You can always check Google Location History to see what sort of data Google has. (Of course, they may always have more than they show.) Seeing the data can be a bit unnerving. Since I’m not really into giving Google more data than they already have to begin with, and I see no reason for Google to know exactly where I spend my time, I decided to turn this feature off.

### Turning it off

You can find the setting by running the “Google Settings” app. That’s right, not “Settings”. Once there, select “Location”.

As you can see, I want to treat Google apps like any other vendor’s apps. As an added bonus, it looks like my GPS is on way less often.

## January 05, 2014

### Josef "Jeff" Sipek

#### x2APIC, IOMMU, Illumos

About a week ago, I hinted at a boot hang I was debugging. I’ve made some progress with it, and along the way I found some interesting things about which I’ll blog over the next few days. Today, I’m going to talk about the  APIC, xAPIC, and  x2APIC and how they’re handled in Illumos.

### APIC, xAPIC, x2APIC

I strongly suggest you become at least a little familiar with APIC architecture before reading on. The Wikipedia articles above are a good start.

First things first, we need some definitions. APIC can refer to either the architecture or to very old (pre-Pentium 4) implementation. Since I’m working with a Sandy Bridge, I’m going to use APIC to refer to the architecture and completely ignore that these chips existed. Everything they do is a subset of xAPIC. xAPIC is an extension to APIC. xAPIC chips started showed up in NetBurst architecture Intel CPUs (i.e., Pentium 4). xAPIC included some goodies such as upping the limit on the number of CPUs to 256 (from 16). x2APIC is an extension to xAPIC. x2APIC chips started appearing around the same time Sandy Bridge systems started showing up. It is a major update to how interrupts are handled, but as with many things in the PC industry the x2APIC is fully backwards compatible with xAPICs. x2APIC includes some goodies such as upping the limit on the number of CPUs to .

Regardless of which exact flavor you happen to use, you will find two components: the local APIC and I/O APIC. Each processor gets their own local APIC and I/O buses get I/O APICs. I/O APICs can service more than one device, and in fact many systems have only one I/O APIC.

The xAPIC uses  MMIO to program the local and I/O APICs.

x2APIC has two mode of operation. First, there is the xAPIC compatibility mode which makes the x2APIC behave just like an xAPIC. This mode doesn’t give you all the new bells and whistles. Second, there is the new x2APIC mode. In this mode, the APIC is programmed using  MSRs.

One interesting fact about x2APIC is that it requires an  iommu. My Sandy Bridge laptop has an Intel iommu as part of the VT-d feature.

### Illumos /etc/mach

x2APIC in Illumos has two APIC drivers. First, there is pcplusmp which knows how to handle APIC and xAPIC. Second, there is apix which targets x2APIC, but knows how to operate it in both modes. On boot, the kernel consults /etc/mach to get a list of machine specific modules to try to load. Currently, the default contents (trimmed for display here) are:

#
# CAUTION!  The order of modules specified here is very important. If the
# order is not correct it can result in unexpected system behavior. The
# loading of modules is in the reverse order specified here (i.e. the last
#
pcplusmp
apix
xpv_psm


Since I’m not running Xen, xpv_psm will fail to load, and apix gets its chance to load.

### pcplusmp + apix Code Sharing

The code in these two modules can be summarized with a word: mess. Following what happens when would be enough of an adventure. The code for the two modules lives in four directories: usr/src/uts/i86pc/io, usr/src/uts/i86pc/io/psm, usr/src/uts/i86pc/io/pcplusmp, and usr/src/uts/i86pc/io/apix. But the sharing isn’t as straight forward as one would hope.

 Directory pcplusmp apix i86pc/io mp_platform_common.c, mp_platform_misc.c, hpet_acpi.c mp_platform_common.c, hpet_acpi.c i86pc/io/psm psm_common.c psm_common.c i86pc/io/pcplusmp * apic_regops.c, apic_common.c, apic_timer.c i86pc/io/apix — *

This is of course not clear at all when you look at the code. (Reality is a bit messier because of the i86xpv platform which uses some of the i86pc source.)

### apix_probe

When the apix module gets loaded, its probe function (apix_probe) is called. This is the place where the module decides if the hardware is worthy. Specifically, if it finds that the CPU reports x2APIC support via  cpuid, it goes on to call the common APIC probe code (apic_probe_common). Unless that fails, the system will use the apix module — even if there is no iommu and therefore the x2APIC needs to operate in xAPIC mode.

What mode are you using? Easy, just check the apic_mode global in the kernel:

# echo apic_mode::whatis | mdb -k
fffffffffbd0ee4c is apic_mode, in apix's data segment
# echo apic_mode::print | mdb -k
0x2


2 (LOCAL_APIC) indicates xAPIC mode, while 3 (LOCAL_X2APIC) indicates x2APIC mode.

Because this part is as clear as mud, I made a table that tells you what module and mode to expect given your hardware, what CPUID says, and the presence and state of the iommu.

 APIC hw CPUID IOMMU IOMMU state Module apic_mode xAPIC off — — pcplusmp LOCAL_APIC x2APIC off — — pcplusmp LOCAL_APIC x2APIC on absent — apix LOCAL_APIC x2APIC on present off apix LOCAL_APIC x2APIC on present on apix LOCAL_X2APIC

### Defaults

I’ve never seen apic_mode equal to LOCAL_X2APIC in the wild. This was very puzzling. Yesterday, I discovered why. As I mentioned earlier, in order for the x2APIC to operate in x2APIC mode an iommu is required. Long story short, the default config that Illumos ships disables iommus on boot. Specifically:

$cat /platform/i86pc/kernel/drv/rootnex.conf | grep -v '^$$#.*\|$$$'
immu-enable="false";


In order to get LOCAL_X2APIC mode, you need to set:

immu-enable="true";
immu-intrmap-enable="true";


Once you put those into the config file, update you boot archive and reboot. You should be set… except the iommu support in Illumos is… shall we say… poor.

(I should point out that it is possible for the BIOS to enable x2APIC mode before handing control off to the OS. This is pretty rare unless you have a really big x86 system.)

#### 1394

It would seem that the hci1394 driver doesn’t quite know how to deal with an iommu “messing” with it’s I/Os and its interrupt service routine shuts down the driver. (On a debug build it throws is ASSERT(0) for good measure.) I just disabled 1394 in the BIOS since I don’t have any Firewire devices handy and therefore no use for the port at the moment.

#### immu-enable Details

In case you want to know how iommu initialization affects the apix initialization…

During boot, immu_init gets called to initialize iommus. If the config option (immu-enable) is not true, the function just returns instead of calling immu_subsystems_setup which calls immu_intrmap_setup which sets psm_vt_ops to non-NULL value.

Later on, when apix is loaded and is initializing itself in apix_picinit, it calls apic_intrmap_init. This function does nothing if psm_vt_ops are NULL.

### The Hang

I might as well tell you a bit about my progress on tracking down the hang. It happens only if I’m using the apix module and I allow deep C states in the idle thread (technically, it could also be an mwait related issue since I cannot disable just mwait without disabling deep C states). It does not matter if the apic_mode is LOCAL_APIC or LOCAL_X2APIC.

## January 04, 2014

### Josef "Jeff" Sipek

#### Post Preview

One of the blogs I’ve been reading for a few months now just had a post about partial vs. full entries on blog front pages. Since I have some opinions on the subject, I decided to comment. My response turned into something sufficiently content-full that I decided that my blahg would be a better place for it. Sorry, Chris :P

First of all, my blog doesn’t support partial post display because… technical reasons. (The sinking feeling of discovering a design mistake in your code really resonated with me about this exact thing.) With that said, I don’t think that partial display is necessarily bad. I feel like any reasonable (this is of course subjective) blogging software should follow these rules:

1. if we’re displaying a atom/rss feed, display full post
2. if we’re displaying a single post, display full post
3. if the post contains magical marker that denotes where to stop the preview, display everything above the marker
4. display full post

I really dislike when the feeds give me the first sentence and I have to click a link to read more. At the very least, it is inconvenient, and in extreeme cases it feels outright insulting.

I think the post-by-post-basis Chris suggests is the way to go, but in the absence of a user-defined division point I would display the whole thing.

Do I write many posts where I wish I could use this magical marker? No. If that were the case, I’d make supporting this a higher priority. However, there have been a handful of times where I believe that the rest of the post is uninteresting to…well…just about everyone and it is really long. So long, that you might get bored trying to scroll past it. (If you are reading my blahg, I don’t want you to be bored because you had to scroll for too long to skip over an entry — you are my guest, and I am here to entertain you.) This is the time I believe displaying a partial post is good.

I’m hoping that eventually I’ll wrestle with my blogging software sufficiently to eliminate the technical reasons preventing me from introducing and processing this special marker. Not that you’ll really notice anything different. :)

## January 02, 2014

### Josef "Jeff" Sipek

#### Designated Initializers

Designated initializers are a neat feature in C99 that I’ve used for about 6 years. I can’t fathom why anyone would not use them if C99 is available. (Of course if you have to support pre-C99 compilers, you’re very sad.) In case you’ve never seen them, consider this example that’s perfectly valid C99:

int abc[7] = {
[1] = 0xabc,
[2] = 0x12345678,
[3] = 0x12345678,
[4] = 0x12345678,
[5] = 0xdef,
};


As you may have guessed, indices 1–5 will have the specified value. Indices 0 and 6 will be zero. Cool, eh?

### GCC Extensions

Today I learned about a neat GNU extension in GCC to designated initializers. Consider this code snippet:

int abc[7] = {
[1] = 0xabc,
[2 ... 5] = 0x12345678,
[5] = 0xdef,
};


Mind blowing, isn’t it?

Beware, however… GCC’s -std=c99 will not error out if you use ranges! You need to throw in -pedantic to get a warning.

$gcc -c -Wall -std=c99 test.c$ gcc -c -Wall -pedantic -std=c99 test.c
test.c:2:5: warning: ISO C forbids specifying range of elements to initialize [-pedantic]


## December 30, 2013

### Josef "Jeff" Sipek

#### Rebooter

I briefly mentioned that I was debugging a boot hang. Since the hang does not happen every time I try to boot, it may take a couple of reboots to get the kernel to hang. Doing this manually is tedious. Thankfully it can be scripted. Therefore, I made a simple script and a SMF manifest that runs the script at the end of boot. If the system boots fine, my script reboots it. If the system hangs mid-boot, well my script never executes leaving the system in a hung state. Then, I can break into the kernel debugger (mdb) and investigate.

I’m sharing the two here mostly for my benefit… in case one day in the future I decide that I need my system automatically rebooted over and over again.

The script is pretty simple. Hopefully, 60 seconds is long enough to log in and disable the service if necessary. (In reality, I setup a separate boot environment that’s the default choice in Grub. I can just select my normal boot environment and get back to non-timebomb system.)

#!/bin/sh

sleep 60

reboot -p


The tricky part is of course in the manifest. Not because it is hard, but because XML is … verbose.

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type='manifest' name='rebooter'>
<service name='site/rebooter' type='service' version='1'>
<dependency name='booted'
grouping='require_all'
restart_on='none'
type='service'>
<service_fmri
value='svc:/milestone/multi-user-server:default'/>
</dependency>

<property_group name="startd" type="framework">
<propval name="duration" type="astring" value="child"/>
<propval name="ignore_error" type="astring"
value="core,signal"/>
</property_group>

<instance name='system' enabled='true'>
<exec_method
type='method'
name='start'
exec='/home/jeffpc/illumos/rebooter/script.sh'
timeout_seconds='0' />

<exec_method
type='method'
name='stop'
exec=':true'
timeout_seconds='0' />
</instance>

<stability value='Unstable' />
</service>
</service_bundle>


That’s all, carry on what you were doing. :)

## December 29, 2013

### Josef "Jeff" Sipek

Recently, I ended up debugging a boot hang. (I’m still working on it, so I don’t have a resolution to it yet.) The hang seems to occur during the mp startup. That is, when the boot CPU tries to online all the other CPUs on the system. As a result, I spent a fair amount of time reading the code and poking around with mdb. Given the effort I put in, I decided to document my understanding of how CPUs get brought online during boot in Illumos. In this post, I’ll talk about the CPU pause threads.

Each CPU has a special thread — the pause thread. It is a very high priority thread that’s supposed to preempt everything on the CPU. If all CPUs are executing this high-priority thread, then we know for fact that nothing can possibly be dereferencing the CPU structures’ (cpu_t) pointers. Why is this useful? Here’s a comment from right above cpu_pause — the function pause threads execute:

/*
* This routine is called to place the CPUs in a safe place so that
* one of them can be taken off line or placed on line.  What we are
* trying to do here is prevent a thread from traversing the list
* of active CPUs while we are changing it or from getting placed on
* the run queue of a CPU that has just gone off line.  We do this by
* creating a thread with the highest possible prio for each CPU and
* having it call this routine.  The advantage of this method is that
* we can eliminate all checks for CPU_ACTIVE in the disp routines.
* This makes disp faster at the expense of making p_online() slower
* which is a good trade off.
*/


The pause thread is pointed to by the CPU structure’s cpu_pause_thread member. A new CPU does not have a pause thread until after it has been added to the list of existing CPUs. (cpu_pause_alloc does the actual allocation.)

CPU pausing is pretty strange. First of all, let’s call the CPU requesting other CPUs to pause the controlling CPU and all online CPUs that will pause the pausing CPUs. (The controlling CPU does not pause itself.) Second, there are two global structures: (1) a global array called safe_list which contains a 8-bit integer for each possible CPU where each element holds a value ranging from 0 to 4 (PAUSE_*) denoting the state of that CPU’s pause thread, and (2) cpu_pause_info which contains some additional goodies used for synchronization.

### Pausing

To pause CPUs, the controlling CPU calls pause_cpus (which uses cpu_pause_start), where it iterates over all the pausing CPUs setting their safe_list entries to PAUSE_IDLE and queueing up (using setbackdq) their pause threads.

Now, just because the pause threads got queued doesn’t mean that they’ll get to execute immediately. That is why the controlling CPU then waits for each of the pause threads to up a semaphore in the cpu_pause_info structure. Once all the pause threads have upped the semaphore, the controlling CPU sets the cp_go flag to let the pause threads know that it’s time for them to go to sleep. Then the controlling CPU waits for each pause thread to signal (via the safe_list) that they have disabled just about all interrupts and that they are spinning (mach_cpu_pause). At this point, pause_cpus knows that all online CPUs are in a safe place.

### Starting

Starting the CPUs back up is pretty easy. The controlling CPU just needs to set all the CPU’s safe_list to a PAUSE_IDLE. That will cause the pausing CPUs to break out of their spin-loop. Once out of the spin loop, interrupts are re-enabled and a CPU control relinquished (via swtch). The controlling CPU does some cleanup of its own, but that’s all that is to it.

### Synchronization

Why not use a mutex or semaphore for everything? The problem lies in the fact that we are in a really fragile state. We don’t want to lose the CPU because we blocked on a semaphore. That’s why this code uses a custom synchronization primitives.

## December 14, 2013

### Josef "Jeff" Sipek

#### iSCSI boot - Success

In my previous post, I documented some steps necessary to get OpenIndiana to boot from iSCSI.

I finally managed to get it to work cleanly. So, here are the remaining details necessary to boot your OI box from iSCSI.

### Installation

First, boot from one of the OI installation media. I used a USB flash drive. Then, before starting the installer, drop into a shell and connect to the target.

# iscsiadm add discovery-address 172.16.0.1
# iscsiadm modify discovery -t enable


At this point, you should have all the LUs accessible:

# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
0. c5t600144F000000000000052A4B4CE0002d0 <SUN-COMSTAR-1.0 cyl 13052 alt 2 hd 255 sec 63>
/scsi_vhci/disk@g600144f000000000000052a4b4ce0002
Specify disk (enter its number):


Exit the shell and start the installer.

Now, the tricky part… When you get to the network configuration page, you must select the “None” option. Selecting “Automatically” will cause nwam to try to start on boot and it’ll step onto the already configured network interface. That’s it. Finish installation normally. Once you’re ready to reboot, either configure your network card or use iPXE as I’ve shared before.

### e1000g

For the curious, here’s what the iSCSI booted (from the e1000g NIC) system looks like:

# svcs network/physical
STATE          STIME    FMRI
disabled       17:13:10 svc:/network/physical:nwam
online         17:13:15 svc:/network/physical:default
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   up       --         --
e1000g0/?         static   ok           172.16.0.179/24
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128


### nge

Does switching back to the on-board nge NICs work now? No. We still get a lovely panic:

WARNING: Cannot plumb network device 19

Warning - stack not written to the dump buffer
fffffffffbc71ae0 genunix:vfs_mountroot+75 ()
fffffffffbc71b10 genunix:main+136 ()
fffffffffbc71b20 unix:_locore_start+90 ()


## December 08, 2013

### Josef "Jeff" Sipek

#### iSCSI boot

I decided a couple of days ago to try to see if OpenIndiana would still fail to boot from iSCSI like it did about two years ago. This post exists to remind me later what I did. If you find it helpful, great.

First, I got to set up the target. There is a bunch of documentation how to use COMSTAR to export a LU, so I won’t explain. I made a 100 GB LU.

I dug up an older system to act as my test box and disconnected its SATA disk. Booting from the OI USB image was uneventful. Before starting the installer, dropped into a shell and connected to the target (using iscsiadm). Then I installed OI onto the LU. Then, I dropped back into the shell to modify Grub’s menu.lst to use the serial port for both the Grub menu as well as make the kernel direct console output there.

Since the two on-board NICs can’t boot off iSCSI, I ended up using iPXE to boot off iSCSI. First, I made a script file:

#!ipxe

dhcp
sanboot iscsi:172.16.0.1:::0:iqn.2010-08.org.illumos:02:oi-test


Then it was time to grab the source and build it. I did run into a simple problem in a test file, so I patched it trivially.

$git clone git://git.ipxe.org/ipxe.git$ cd ipxe
$cat /tmp/ipxe.patch diff --git a/src/tests/vsprintf_test.c b/src/tests/vsprintf_test.c index 11512ec..2231574 100644 --- a/src/tests/vsprintf_test.c +++ b/src/tests/vsprintf_test.c @@ -66,7 +66,7 @@ static void vsprintf_test_exec ( void ) { /* Basic format specifiers */ snprintf_ok ( 16, "%", "%%" ); snprintf_ok ( 16, "ABC", "%c%c%c", 'A', 'B', 'C' ); - snprintf_ok ( 16, "abc", "%lc%lc%lc", L'a', L'b', L'c' ); + //snprintf_ok ( 16, "abc", "%lc%lc%lc", L'a', L'b', L'c' ); snprintf_ok ( 16, "Hello world", "%s %s", "Hello", "world" ); snprintf_ok ( 16, "Goodbye world", "%ls %s", L"Goodbye", "world" ); snprintf_ok ( 16, "0x1234abcd", "%p", ( ( void * ) 0x1234abcd ) );$ patch -p1 < /tmp/ipxe.patch
$make bin/ipxe.usb EMBED=/tmp/ipxe.script$ sudo dd if=bin/ipxe.usb of=/dev/rdsk/c8t0d0p0 bs=1M


Now, I had a USB flash drive with iPXE that’d get a DHCP lease and then proceed to boot from my iSCSI target.

Did the system boot? Partially. iPXE did everything right — DHCP, storing the iSCSI information in the  iBFT, reading from the LU and handing control over to Grub. Grub did the right thing too. Sadly, once within kernel, things didn’t quite work out the way they should.

### iBFT

Was the iBFT getting parsed properly? After reading the code for a while and using mdb to examine the state, I found a convenient tunable (read: global int that can be set using the debugger) that will cause the iSCSI boot parameters to be dumped to the console. It is called iscsi_print_bootprop. Setting it to non-zero will produce nice output:

Welcome to kmdb
kmdb: unable to determine terminal type: assuming vt100'
Loaded modules: [ unix krtld genunix ]
[0]> iscsi_print_bootprop/W 1
iscsi_print_bootprop:           0               =       0x1
[0]> :c
OpenIndiana Build oi_151a7 64-bit (illumos 13815:61cf2631639d)
SunOS Release 5.11 - Copyright 1983-2010 Oracle and/or its affiliates.
Initiator Name : iqn.2010-04.org.ipxe:00020003-0004-0005-0006-000700080009
Local gateway  : 172.16.0.1
Local DHCP     : 0.0.0.0
Local MAC      : 00:02:b3:a8:66:0c
Target Name    : iqn.2010-08.org.illumos:02:oi-test
Target IP      : 172.16.0.1
Target Port    : 3260
Boot LUN       : 0000-0000-0000-0000


### nge vs. e1000g

So, the iBFT was getting parsed properly. The only “error” message to indicate that something was wrong was the “Cannot plumb network device 19”. Searching the code reveals that this is in the rootconf function. After more tracing, it became apparent that the kernel was trying to set up the NIC but was failing to find a device with the MAC address iBFT indicated. (19 is ENODEV)

At this point, it dawned on me that the on-board NICs are mere nge devices. I popped in a PCI-X e1000g moved the cable over and rebooted. Things got a lot farther!

### unable to connect

Currently, I’m looking at this output.

NOTICE: Configuring iSCSI boot session...
NOTICE: iscsi connection(5) unable to connect to target iqn.2010-08.org.illumos:02:oi-test
Hostname: oi-test
Configuring devices.
NOTICE: iscsi connection(12) unable to connect to target iqn.2010-08.org.illumos:02:oi-test
`

The odd thing is, while these appear SMF is busy loading manifests and tracing the iSCSI traffic to the target shows that the kernel is doing a bunch of reads and writes. I suspect that all the successful I/O was done over one connection and then something happens and we lose the link. This is where I am now.

## December 01, 2013

### Josef "Jeff" Sipek

A couple of months ago, I decided to update my almost two and a half year old laptop. Twice.

First, I got more RAM. This upped it to 12 GB. While still on the low side for a box which actually gets to see some heavy usage (compiling illumos takes a couple of hours and generates a couple of GB of binaries), it was better than the 4 GB I used for way too long.

Second, I decided to bite the bullet and replaced the 320 GB disk with a 256 GB SSD (Samsung 840 Pro). Sadly, in the process I had the pleasure of reinstalling the system — both Windows 7 and OpenIndiana. Overall, the installation was uneventful as my Windows partition has no user data and my OI storage is split into two pools (one for system and one for my data).

The nice thing about reinstalling OI was getting back to a stock OI setup. A while ago, I managed to play with software packaging a bit too much and before I knew it I was using a customized fork of OI that I had no intention of maintaining. Of course, I didn’t realize this until it was too late to rollback. Oops. (Specifically, I had a custom pkg build which was incompatible with all versions OI ever released.)

One of the painful things about my messed-up-OI install was that I was running a debug build of illumos. This made some things pretty slow. One such thing was boot. The ZFS related pieces took about a minute alone to complete. The whole boot procedure took about 2.5 minutes. Currently, with a non-debug build and an SSD, my laptop goes from Grub prompt to gdm login in about 40 seconds. I realize that this is an apples to oranges comparison.

I knew SSDs were supposed to be blazing fast, but I resisted getting one for the longest time mostly due to reliability concerns. What changed my mind? I got to use a couple of SSDs in my workstation at work. I saw the performance and I figured that ZFS would take care of alerting me of any corruption. Since most of my work is version controlled, chances are that I wouldn’t lose anything. Lastly, SSDs got a fair amount of improvements over the past few years.

## November 29, 2013

### Josef "Jeff" Sipek

#### Biometrics

Last week I got to spend a bit of time in NYC with obiwan. He’s never been in New York, so he did the tourist thing. I got to tag along on Friday. We went to the Statue of Liberty, Ellis Island, and a pizza place.

You may have noticed that this post is titled “Biometrics,” so what’s NYC got to do with biometrics? Pretty simple. In order to get into the Statue of Liberty, you have to first surrender your bags to a locker and then you have to go through a metal detector. (This is the second time you go through a metal detector — the first is in Battery Park before you get on the boat to Liberty Island.) Once on Liberty Island, you go into a tent before the entrance where you get to leave your bags and \$2. Among the maybe 500–600 lockers, there are two or three touch screen interfaces. You use these to rent a locker. After selecting the language you wish to communicate in and feeding in the money, a strobe light goes off blinding you — this is to indicate where you are supposed to place your finger to have your finger print scanned. Your desire to rent a locker aside, you want to put your finger on the scanner to make the strobe go away. Anyway, once the system is happy it pops a random (unused) locker open and tells you to use it.

What could possibly go wrong.

After visiting the statue, we got back to the tent to liberate the bags. At the same touch screen interface, we entered in the locker number and when prompted scanned the correct finger. The fingerprint did not get recognized. After repeating the process about a dozen times, it was time to talk to the people running the place about the malfunction. The person asked for the locker number, went to the same interface that we used, used what looked like a  one-wire key fob near the top of the device to get an admin interface and then unlocked the locker. That’s it. No verification of if we actually owned the contents of the locker.

I suppose this is no different from a (physical) key operated locker for which you lost the key. The person in charge of renting the lockers has no way to verify your claim to the contents of the locker. Physical keys, however, are extremely durable compared to the rather finicky fingerprint scanners that won’t recognize you if you look at them the wrong way (or have oily or dirty fingers in a different way than they expect). My guess the reason the park service went with a fingerprint based solution instead of a more traditional physical key based solution is simple: people can’t lose the locker keys if you don’t use them. Now, are cheap fingerprint readers accurate enough to not malfunction like this often? Are the people supervising the locker system generally this apathetic about opening a locker without any questions? I do not know, but my observations so far are not very positive.

I suspect more expensive fingerprint readers will perform better. It just doesn’t make sense for something as cheap as a locker to use the more expensive readers.