# Planet LILUG

## October 06, 2015

### Josef "Jeff" Sipek

#### blahgd fmt 3 Limitations

Recently, I described the four post formats (fmt 0–3) that my blogging software ever supported. This is the promissed follow up.

I am reasonably happy with fmt 3, but I do see some limitations that I’d like to address. This will inevitably lead to fmt 4. fmt 4 will still be LaTeX-like, but it will address the following issues present in fmt 3.

HTML rendering is context in-sensitive: It turns out that there are many instances where blindly converting a character to whatever is necessary to render it properly in HTML is the wrong thing to do. For example, there are many Wikipedia articles that contain an apostrophe in the name. In a recent link dump, I mentioned  Fubini’s theorem. The apostrophe used in the URL must be an ASCII code 0x27 and not a Unicode 0x2019 (aka. &rsquo;). If I forget to do this and type:

\wiki{Fubini's theorem}


The link text will look nice (thanks to &rsquo;), but the link URL will be broken because there is no Wikipedia article called “Fubini’s Theorem”. To work around this, I end up using:

\wiki[Fubini's theorem]{Fubini\%27s theorem}


This will use &rsquo; for the link text and %27 in the link URL. It’s ugly and not user friendly.

The “wiki” command isn’t the only one where special behavior makes sense.

Sadly, the fmt 3 parser just isn’t suited to allow for this. It is the yacc grammar that converts single and double quotes (among other characters) to the appropriate HTML escapes. Eventually, this converted string gets fed to the function that turns the one or two arguments into a link to Wikipedia. At this point, the original raw text is lost. However that is exactly what is needed to link to the proper place!

Metadata is duplicated: Another issue is that it would be nice to keep the metadata of the post with the post text itself. LaTeX-like markup lends itself very nicely to this. For example:

\title{Post Formats in blahgd}
\tag{blahg}
\published{2015-10-05 19:04}


Unfortunately, it’d take some unpleasant hacks to stash these values (unmangled!) in the structure keeping track of the post while a request is processed. So, even for fmt 3, I have to keep metadata in a separate metadata file. (The thoughts and design behind the metadata file could easily fill a post of its own.)

At first, I did not mind this extra copy of the metadata. However, over time it became obvious that this duplication leads to the age-old consistentcy problem. It is tempting to solve this problem by restricting which copy can be modified. Given that being able to edit everything with a text editor is a key goal of blahgd, restricting what can be editted is the wrong solution. Instead, eliminating all but one copy of the metadata is the proper way to solve this.

### Future Plans

To solve these limitations with fmt 3, I am planning for the next format’s parser to do less. Instead of containing the entire translation code in the lex and yacc files, I will have the parser produce a parse tree. Then, the code will transform the parse tree into an AST which will then be transformed into whatever output is necessary (e.g., HTML, Atom, RSS). This will take care of all of the above mentioned issues since the rendering pass will have access at the original text (or equivalent of it). Yes, this sounds a bit heavy-handed but I really think it is the way to go. (For what it’s worth, I am not the only one who thinks that converting markup to HTML should go through an AST.)

Obviously, one of the goals with fmt 4 is to keep as close to fmt 3 as is practical. This will allow a quick and easy migration of the over 400 posts currently using it.

## October 05, 2015

### Josef "Jeff" Sipek

#### Post Formats in blahgd

After reading What creates a good wikitext dialect depends on how it’s going to be used, I decided to write a short post about how I handled the changing needs that my blahg experienced.

One of the things that I implemented in my blogging software a long time ago was the support for different flavors of markup. This allowed me to switch to a “saner” markup without revisiting all the previous entries. Since every post already has metadata (title, publication time, etc.) it was easy enough to add another field (“fmt”) which in my case is just a simple integer identifying how the post contents should be parsed.

Over the years, there have been four formats:

fmt 0 (removed)
Wordpress compat
fmt 1
“Improved” Wordpress format
fmt 2
raw html
fmt 3
LaTeX-like (my current format of choice)

The formats follow a pretty natural progression.

It all started in January 2009 as an import of about 250 posts from Wordpress. Wordpress stores the posts in a html-esque format. At the very least, it inserts line and paragraph breaks. If I remember correctly, one newline is a line break and two newlines are a paragraph break, but otherwise is passes along HTML more or less unchanged. Suffice to say, I was not in the mood to rewrite all the posts in a different format right away so I implemented something that resembled the Wordpress behavior. I did eventually end up converting these posts to a more modern format and then I removed support for this one.

The next format (fmt 1) was a natural evolution of fmt 0. One thing that drove me nuts about fmt 0 was the handling of line breaks. Since the whole point of blahgd was to allow me to compose entries in a text editor (vim if you must know) and I like my text files to be word wrapped, the transformation of every newline to <br/> was quite annoying. (It would result in jagged lines in the browser!) So, about a month later (February 2009), I made a trivial change to the parsing code to treat a single newline as just whitespace. I called this changed up parser fmt 1. (There are currently 24 posts using this format.)

A couple of months after I added fmt 1, I came to the conclusion that in some cases I just wanted to write raw HTML. And so fmt 2 was born (April 2009). (There are currently 5 posts using this format.)

After using fmt 2 for about a year and a half, I concluded that writing raw HTML is not the way to go. Any time I wanted to change the rendering of a certain thing (e.g., switching between <strong> and <b>), I had to revisit every post. Since I am a big fan of LaTeX, I thought it would be cool to have a LaTeX-like markup. It took a couple of false starts spanning about six months but eventually (February 2011) I got the lex and yacc files working well enough. (There are currently 422 posts using this format.)

While I am reasonably happy with fmt 3, but I do see some limitations that I’d like to address. This will inevitably lead to fmt 4. I am hoping to make a separate post about the fmt 3 limitations and how fmt 4 will address them sometime in the (hopefully) near future.

## September 26, 2015

### Josef "Jeff" Sipek

#### GNU inline vs. C99 inline

Recently, I’ve been looking at inline functions in C. However instead of just the usual static inlines, I’ve been looking at all the variants. This used to be a pretty straightforward GNU C extension and then C99 introduced the inline keyword officially. Sadly, for whatever reason decided that the semantics would be just different enough to confuse me and everyone else.

GCC documentation has the following to say:

GCC implements three different semantics of declaring a function inline. One is available with -std=gnu89 or -fgnu89-inline or when gnu_inline attribute is present on all inline declarations, another when -std=c99, -std=c11, -std=gnu99 or -std=gnu11 (without -fgnu89-inline), and the third is used when compiling C++.

Dang! Ok, I don’t really care about C++, so there are only two ways inline can behave.

Before diving into the two different behaviors, there are two cases to consider: the use of an inline function, and the inline function itself. The good news is that the use of an inline function behaves the same in both C90 and C99. Where the behavior changes is how the compiler deals with the inline function itself.

After reading the GCC documentation and skimming the C99 standard, I have put it all into the following table. It lists the different ways of using the inline keyword and for each use whether or not a symbol is produced in C90 (with inline extension) and in C99.

 Emit (C90) Emit (C99) inline always never static inline maybe maybe extern inline never always

(“always” means that a global symbol is always produced regardless of if all the uses of it are inlined. “maybe” means that a local symbol will be produced if and only if some uses cannot be inlined. “never” means that no symbols are produced and any non-inlined uses will be dealt with via relocations to an external symbol.)

Note that C99 “switched” the meaning of inline and extern inline. The good news is, static inline is totally unaffected (and generally the most useful).

For whatever reason, I cannot ever remember this difference. My hope is that this post will help me in the future.

### Trying it Out

We can verify this experimentally. We can compile the following C file with -std=gnu89 and -std=gnu99 and compare what symbols the compiler produces:

static inline void si(int x)
{
}

extern inline void ei(int x)
{
}

inline void i(int x)
{
}


And here’s what nm has to say about them:

test-gcc89:
00000000 T i

test-gcc99:
00000000 T ei


This is an extremely simple example where the “never” and “maybe” cases all skip generating a symbol. In a more involved program that has inline functions that use features of C that prevent inlining (e.g., VLAs) we would see either relocations to external symbols or local symbols.

## September 23, 2015

### Josef "Jeff" Sipek

#### 2015-09-23

The Apple ISA — An interesting view of what Apple could aim for instruction set architecture-wise.

Internetová jazyková příručka — A reference book with grammar and dictionary detailing how to conjugate each Czech word.

Java is Magic: the Gathering (or Poker) and Haskell is Go (the game)

An Interview with Brian Kernighan (July 2000)

Booting a Raspberry Pi2, with u-boot and HYP enabled

The SmPL Grammar — Description of the grammar used by Coccinelle.

Netbooting Debian Squeeze — A link I had sitting around for a couple of years when I last set up a NFS-root netbooting Linux system.

Are there any 3 dimensional items wwe can’t print layer by layer — A humorous story about  Fubini’s theorem and its relation to 3D printing.

The Diagnosis of Mistakes in Programmes on the EDSAC — In some ways, debugging hasn’t changed much since 1951.

## September 22, 2015

### Josef "Jeff" Sipek

#### git filter-branch

Recently, I had to rewrite some commits in a git repository. All I wanted to do was set the author and committer names and emails to the correct value for all the commits in a repository. (Have you ever accidentally committed with user@some.host.local as the email address? I have.) It turns out that git has a handy command for that: git filter-branch. Unfortunately, using it is a bit challenging. Here’s what I ended up doing. (In case it isn’t clear, I am documenting what I have done in case I ever need to do it again on another repository.)

The invocation is relatively easy. We want to pass each commit to a script that creates a new commit with the proper name and email. This is done via the –commit-filter argument. Further, we want to rewrite each tag to point to the new commit hash. This is done via the –tag-filter argument. Since we’re not trying to change the contents of the tag, we use cat to simply pass through the tag contents.

$git filter-branch \ --commit-filter '/home/jeffpc/src/poc-clean/process.sh "$@"' \
--tag-name-filter cat \
-- fmt4 load-all master
Rewrite a95e3603e5ec40e6f229e75425f1969f13c17820 (710/710)
Ref 'refs/heads/fmt4' was rewritten
Ref 'refs/heads/master' was rewritten
v3.1 -> v3.1 (993683bf104f42a74a2c58f2a91aee561573f7cc -> 1a1f4ff657abc8e97879f68a5dc4add664980b71)
v3.2 -> v3.2 (090b3ff1a66fa82d7d8fc99976c42c9495d5a32f -> 60fbeb91b689c65217b5ea17e68983d6aebc0239)
v3.3 -> v3.3 (4fb6d3ac2c5b88e69129cefe92d08decb341e1ae -> dd75fbb92353021c2738da2848111b78d1684405)


Caution: git rewrite-branch changes the directory while it does all the work so don’t try to use relative paths to specify the script.

The commit filter script is rather simple:

#!/bin/sh

name="Josef 'Jeff' Sipek"
email="jeffpc@josefsipek.net"

export GIT_AUTHOR_NAME="$name" export GIT_AUTHOR_EMAIL="$email"
export GIT_COMMITTER_NAME="$name" export GIT_COMMITTER_EMAIL="$email"

exec git commit-tree "$@"  It just the right environmental variables to pass the right name and email to git commit-tree, which writes out the commit object. That’s it! I hope this helps. ## September 03, 2015 ### Josef "Jeff" Sipek #### Dumping Memory in MDB It doesn’t take much reading of documentation, other people’s blogs, and other random web search results to learn how to dump a piece of memory in mdb. In the following examples, I’ll use the address fffffffffbc30a70. This just so happens to be an avl_tree_t on my system. We can use the ::dump command: > fffffffffbc30a70::dump \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef fffffffffbc30a70: 801dc3fb ffffffff 0087b1fb ffffffff ................  Or we can use the adb-style /B command: > fffffffffbc30a70/B kas+0x50: 80  We can even specify the amount of data we want to dump. ::dump takes how many bytes to dump, while /B takes how many 1-byte integers to dump (while for example, /X takes how many 4-byte integers to dump): > fffffffffbc30a70,20::dump \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef fffffffffbc30a70: 801dc3fb ffffffff 0087b1fb ffffffff ................ fffffffffbc30a80: 20000000 00000000 09000000 00000000 ............... > fffffffffbc30a70,20/B kas+0x50: 80 1d c3 fb ff ff ff ff 0 87 b1 fb ff ff ff ff 20 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0  Things break down if we want to use a walker and pipe the output to ::dump or /B: > fffffffffbc30a70::walk avl | ::dump \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef fffffffffbc6d2e0: 00000000 00feffff 0000001e 03000000 ................ > fffffffffbc30a70::walk avl | /B kpmseg: kpmseg: 0 0 0 0 0 0 0 0 0  Even though there are 9 entries in the AVL tree, ::dump dumps only the first one. /B does a bit better and it does print what appears to be the first byte of each. What if we want to dump more than just the first byte? Say, the first 32? ::dump is of no use already. Let’s see what we can make /B do: > fffffffffbc30a70::walk avl | 20/B mdb: syntax error near "20" > fffffffffbc30a70::walk avl | ,20/B mdb: syntax error near ","  No luck. ### Solution Ok, it’s time for the trick that makes it all work. You have to use the ::eval function. For example: > fffffffffbc30a70::walk avl | ::eval .,20::dump \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef fffffffffbc6d2e0: 00000000 00feffff 0000001e 03000000 ................ fffffffffbc6d2f0: 00000000 00000000 200ac3fb ffffffff ........ ....... \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef fffffffffbc34960: 00000000 00ffffff 00000017 00000000 ................ fffffffffbc34970: 00000000 00000000 200ac3fb ffffffff ........ ....... \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef fffffffffbc31ce0: 00000017 00ffffff 00000080 00000000 ................ fffffffffbc31cf0: 00000000 00000000 200ac3fb ffffffff ........ ....... \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef fffffffffbc35a80: 00000097 00ffffff 0000a0fc 02000000 ................ fffffffffbc35a90: 00000000 00000000 200ac3fb ffffffff ........ ....... \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef fffffffffbc34880: 0000a0d3 03ffffff 00000004 00000000 ................ fffffffffbc34890: 00000000 00000000 200ac3fb ffffffff ........ ....... \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef fffffffffbc31d60: 0000a0d7 03ffffff 000060e8 fb000000 ............... fffffffffbc31d70: 00000000 00000000 200ac3fb ffffffff ........ ....... \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef fffffffffbc7f3a0: 000000c0 ffffffff 00b07f3b 00000000 ...........;.... fffffffffbc7f3b0: 00000000 00000000 200ac3fb ffffffff ........ ....... \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef fffffffffbc7de60: 000080fb ffffffff 00105500 00000000 ..........U..... fffffffffbc7de70: 00000000 00000000 200ac3fb ffffffff ........ ....... \/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef fffffffffbc7e000: 000080ff ffffffff 00004000 00000000 ..........@..... fffffffffbc7e010: 00000000 00000000 200ac3fb ffffffff ........ .......  Perfect! ::eval makes repetition with /B work as well: > fffffffffbc30a70::walk avl | ::eval .,8/B kpmseg: kpmseg: 0 0 0 0 0 fe ff ff kvalloc: kvalloc: 0 0 0 0 0 ff ff ff kpseg: kpseg: 0 0 0 17 0 ff ff ff kzioseg: kzioseg: 0 0 0 97 0 ff ff ff kmapseg: kmapseg: 0 0 a0 d3 3 ff ff ff kvseg: kvseg: 0 0 a0 d7 3 ff ff ff kvseg_core: kvseg_core: 0 0 0 c0 ff ff ff ff ktextseg: ktextseg: 0 0 80 fb ff ff ff ff kdebugseg: kdebugseg: 0 0 80 ff ff ff ff ff  ### /nap There is one more trick I want to share in this post. Suppose you have a mostly useless core file, and you want to dump the stack. Not as hex, but rather as a symbol + offset (if possible). The magic command you want is /nap. ‘/’ for printing, ‘n’ for a newline, ‘a’ for symbol + offset (of the value at “dot”), and ‘p’ for symbol (or address) of “dot”. (Formatting differences aside, ‘p’ prints the pointer—“dot”, and ‘a’ prints the value being pointed to—*“dot”.) For example: > fd94e3a8,8/nap 0xfd94e3a8: 0xfd94e3a8: 0xfd94f5a8 0xfd94e3ac: libzfs.so.1namespace_reload+0x394 0xfd94e3b0: 0xfdd6ce28 0xfd94e3b4: 0xfdd6a423 0xfd94e3b8: 0xcc 0xfd94e3bc: libzfs.so.1__func__.16928 0xfd94e3c0: 0xfdd6ce00 0xfd94e3c4: 0xfdd6ce28  Since the memory happens to be part of the stack, there are no symbols associated with it and therefore the ‘p’ prints a raw hex value. So, remember: if you have a core file and you think that you need to dump the stack to scavenge for hopefully useful values, you want to…nap. :) ## September 01, 2015 ### Josef "Jeff" Sipek #### 2015-09-01 Interchange — This definitely reminds me of xkcd: Highway Engineer Pranks. At the same time, it is fascinating how there is a whole set of standard interchanges. DxOMark — Very in-depth reviews of SLR lenses and bodies. Lisp as the Maxwell’s equations of software — Reading this has rekindled my interest in Lisp and Scheme. This Man Has Been Trying to Live Life as a Goat What’s going on with a Python assignment puzzle — As a C programmer, this is totally counter-intuitive to me. Internet Mainframes Project — Screenshots of 3270 login screens of tons of internet facing mainframes. ## August 10, 2015 ### Josef "Jeff" Sipek #### 2015-08-10 Cap’n Proto — an insanely fast data interchange format. The First Port of Unix International Obfuscated C Code Contest winning entries — list of all the winning entries for all 23 years of IOCCC. How to destroy Programmer Productivity The Open-Office Trap — Having spent a couple of years in an open-office environment, I can tell you first hand that open-office is a bad idea. It was very annoying dealing with one set of light switches for about 70 people. The people near the window wanted them off, while us far away from the window wanted them on. The noise was also annoying — the chatty people (e.g., sales and support) were the worst. They were not malicious, just doing their job or chatting between phone calls. Inside The Secret World of Russia’s Cold War Mapmakers Underwater rugby — It is more like underwater basketball. I find it fascinating that player positions are a 3D location not a 2D location like in “normal” sports. The Memory Sinkhole — an x86 design flaw allowing ring -2 privilege escalation. ## August 03, 2015 ### Justin Lintz #### 30 Days without Social Media On July 3rd, I decided I was taking a break from social media. I deleted Instagram, Twitter, Facebook, Snapchat from my iPhone, iPad and removed any links to them in my browser (I didn’t delete my profiles). A full month seemed like a good amount of time to detox from all of these and see what would change in my life without them. The first couple of days I was certainly fidgety, kept looking down at my phone and not knowing what to do with it. Call people? Nah… I decided I was going to replace a lot of the screen time I had with reading books. Anyone who knows me, knows that I am not a book reader. I haven’t read a fiction book since required reading my senior year of high school, 90% of the books I own are related to computer science. Normally it takes me a couple of months to read through the smallest of books, just reading a few pages here and there, then eventually being distracted. I’ve always struggled with focusing while reading, my mind would drift off very easily to the point where I thought I had ADD (My therapist assures me I don’t) or I would have the constant urge to check my phone and see what’s going on in the world. With my phone basically striped of it’s main purposes for me, I set out and bought my first fiction book, “The Martian” by Andy Weir. I figured it was a good starting point since the movie was coming out and everyone I had talked to said it was an easy read. For the first time in probably almost 20 years I found myself actually being immersed in a book where imagery was being created as I read over each word and the reading of another chapter didn’t feel like a chore. In the last 30 days I’ve read 5 books and finishing up a 6th. 1. The Martian 2. 10% Happier 3. Ready Player One 4. Gone Girl 5. Pines (Wayward Pines #1) 6. The Goldfinch” (still reading) This was easily the most amount of reading I’ve ever done in such a short time span. For the first time in my life , I got to be one of those people who said “The book was better”, after reading “Gone Girl”. It’s been a great detox from being in front of a computer screen all day (these were all paperback books). Could I have done all this without going through the whole social media blackout and a little bit of self discipline? Probably, but with the temptation completely removed it made it a lot more easy to immerse myself in a book and not have any thoughts in the back of my mind. Another big change I noticed in my life was that I was completely disconnected from the news of the world. I hardly ever watch the news on TV and only occasionally check the headlines on cnn.com. I was days late on major news stories, and I’m sure I’m still very out of the loop on others. With twitter we’re so used to finding out things as they happen, often getting a distorted truth of something as it unfolds. I realized how much I depended on it for just knowing what was going on in the world and in my industry. I still got emails for links that were shared at least 5 times in my timeline thanks to my friend Will McCutchen’s Thresholder bot but that was all I heard from twitter in the past 30 days. Without the constant need to check my phone I also found myself being more “in the moment”. Going to a museum and truly enjoying the art instead of worrying about taking a photo of it and sharing it via Instagram or a Snapchat story. Even just going from point A to point B forced me to be more in the moment since I wasn’t reading something on my phone. Being out with friends and not thinking about taking a photo and sharing it with the world but actually spending time with the people around me was refreshing. At work I felt more focused without checking social media in-between tasks. It wasn’t until I removed all the apps did I realize how often I was on them while doing other things. I couldn’t watch a TV show without taking out my phone to check something. Waiting for an elevator and riding in one suddenly felt like a chore without reading through my twitter stream (don’t worry I still just faced forward and refused to acknowledge anyone else in the elevator). If I wanted to know what was going on in my friends lives over the last 30 days I actually had to talk to them instead of seeing a tweet/status/photo. At my 10 year HS reunion a couple of years ago, it was amazing how much I felt like I could just pick up a convo with so many people I hadn’t seen in years, just because I had been keeping up with all their posts on Facebook. Who knows how many engagements and babies I’ve missed out on in the last 30 days. Going forward I’m definitely not going to go back to checking social media the same amount as I did before. Waking up in the morning and reading through all the posts from last night before even getting out of bed. I’ll probably limit myself to checking apps a couple of times a day and not be so concerned with missing out on any posts in my streams. Now I need to finish editing this post and go share it on social media, otherwise, no one would ever see this. ## July 26, 2015 ### Eitan Adler #### Pre-Interview NDAs Are Bad I get quite a few emails from business folk asking me to interview with them or forward their request to other coders I know. Given the volume it isn't feasible to respond affirmatively to all these requests. If you want to get a coder's attention there are a lot of things you could do, but there is one thing you shouldn't do: require them to sign an NDA before you interview them. From the candidates point of view: 1. There are a lot more ideas than qualified candidates. 2. Its unlikely your idea is original. It doesn't mean anyone else is working on it, just that someone else probably thought of it. 3. Lets say the candidate was working on a similar, if not identical project. If the candidate fails to continue with you now they have to consult a lawyer to make sure you can't sue them for a project they were working on before 4. NDAs are hard legal documents and shouldn't be signed without consulting a lawyer. Does the candidate really want to find a lawyer before interviewing with you? 5. An NDA puts the entire obligation on the candidate. What does the candidate get from you? From a company founders point of view: 1. Everyone talks about the companies they interview with to someone. Do you want to be that strange company which made them sign an NDA? It can harm your reputation easily. 2. NDAs do not stop leaks. They serve to create liability when a leak occurs. Do you want to be the company that sues people that interview with them? There are some exceptions; for example government and security jobs may require security clearance and an NDA. For mose jobs it is possible to determine if a coder is qualified and a good fit without disclosing confidential company secrets. ## June 26, 2015 ### Josef "Jeff" Sipek #### 2015-06-26 Who Has Your Back? — An annual report looking at how different major companies react to government requests for data. Learn Lua in 15 Minutes Mega-processor — A project to build a micro-processor using discrete transistors. Stevey’s Google Platforms Rant — A rant by a Googler about Google’s failure to understand platforms (vs. products). ## June 22, 2015 ### Josef "Jeff" Sipek #### Simple File System On three or four occasions over the past 4 years, I had a use for simple file system spec. Either to teach people about file systems, or to have a simple file system to implement to learn the idiosyncracies of an operating system’s VFS layer. This is what I came up with back in 2011 when helping a friend learn about file systems. ### Simple File System The structure is really simple. All multi-byte integers are stored as big endian. A disk is a linear sequence of blocks. Each block is 512 bytes long. You can read/write a block at a time. The first block on the disk is number 0, the second is 1, etc. The following is the file system structure. First of all, the file system uses 1024 byte blocks, and therefore you need to issue two disk I/Os to process a file system block worth of data. The first fs block (disk blocks 0 & 1) is reserved, you should not change it in any way. The second block contains the superblock: struct superblock { uint32_t magic; /* == 0x42420374 */ uint32_t root_inode; /* the disk block containing the root inode */ uint32_t nblocks; /* number of block on the disk */ uint32_t _pad[253]; /* unused (should be '\0' filled) */ };  Starting at the third block is the block allocation map. The most significant bit of the first byte of this block represents fs block 0. The next bit represents block 1, etc. Each file is represented by an inode. The inode contains a number of data block pointers. The first pointer (blocks[0]) contains the first 1024 bytes of the file, blocks[1] the second, etc. The timestamps are in microseconds since 00:00:00 Jan 1, 1900 UTC. sturct inode { uint32_t size; /* file length in bytes */ uint32_t _pad0; /* unused (should be 0) */ uint64_t ctime; /* creation time stamp */ uint64_t mtime; /* last modification time stamp */ uint16_t nblocks; /* number of data blocks in this file */ uint16_t _pad1; /* unused (should be 0) */ uint32_t _pad2; /* unused (should be 0) */ uint32_t blocks[248]; /* file block ptrs */ };  The root directory is represented by an inode, the data pointed to by this inode’s blocks[] have a special format. They should be treated as arrays of directory entries. The filename is space padded (so, “foo.txt” would be stored as “foo.txt ”). struct direntry { char fname[28]; /* the filename */ uint32_t inode; /* the inode block ptr */ };  So, graphically, the file system looks something like: sb -> rootinode |-> direntries | |-> <"foo.txt", inodeptr> | | \-> inode | | |-> data | | |-> data | | \-> data | |-> <"bar.txt", inodeptr> | | \-> inode | | |-> data | | |-> data | | \-> data | | . | | . | | . | \-> <"xyz.txt", inodeptr> | \-> inode | |-> data | |-> data | \-> data | . | . | . \-> direntries |-> <"aaa.txt", inodeptr> | \-> inode | |-> data | |-> data | \-> data |-> <"bbb.txt", inodeptr> | \-> inode | |-> data | |-> data | \-> data | . | . | . \-> <"ccc.txt", inodeptr> \-> inode |-> data |-> data \-> data  ## June 16, 2015 ### Josef "Jeff" Sipek #### Tail Call Optimization I just had an interesting realization about tail call optimization. Often when people talk about it, they simply describe it as an optimization that the compiler does whenever you end a function with a function call whose return value is propagated up as is. Technically this is true. Practically, people use examples like this: int foo(int increment) { if (increment) return bar() + 1; /* NOT a tail-call */ return bar(); /* a tail-call */ }  It sounds like a very solid example of a tail-call vs. not. I.e., if you “post process” the value before returning it is not a tail-call. Going back to my realization, I think people often forget about one type of “post processing” — casts. Consider the following code: extern short bar(); int foo() { return bar(); }  Did you spot it? This is not a tail-call. The integer promotion from short to int is done after bar returns but before foo returns. For fun, here’s the disassembly: $ gcc -Wall -O2 -c test.c
$dis test.o ... foo() foo: 55 pushl %ebp foo+0x1: 89 e5 movl %esp,%ebp foo+0x3: 83 ec 08 subl$0x8,%esp
foo+0x6: e8 fc ff ff ff     call   -0x4	<foo+0x7>
foo+0xb: c9                 leave
foo+0xc: 98                 cwtl
foo+0xd: c3                 ret


For completeness, if we change the return value of bar to int:

$gcc -Wall -O2 -c test.c$ dis test.o
...
foo()
foo:       e9 fc ff ff ff     jmp    -0x4	<foo+0x1>


I wonder how many people think they are tail-call optimizing, but in reality they are “wasting” a stack frame because of this silly reason.

#### Blogging My Way Through CLRS Section 4.1

After another long break of not writing up any CLRS answers here is section 4.1.
Question 4.1-1:

What does $\textit{Find-Maximum-Subarray}$ return when all elements of $A$ are negative?

The procedure would return the single element of maximum value. This is expected since the maximum subarray must contain at least one element. This can be computed by note that $\textit{Find-Max-Crossing-Subarray}$ will always return the array of solely the midpoint and that $\textit{Find-Maximum-Subarray}$ always finds the maxium of $\{leftsum, rightsum, and crosssum\}$

Question 4.1-2:

Write pseudocode for the brute-force method of solving the max-subarray problem. Your solution should run in $\theta(n^2)$ time.

max_i = nilmax_j = nilmax_sum = -∞for i in 0..len(A):   cur_sum = 0   for j in i..len(A):     cur_sum += A[j]     if cur_sum > max_sum:       max_sum = cur_sum       max_i = i       max_j = jreturn (max_i, max_j, max_sum) 

Question 4.1-3:

Implement both the brute-force and recursive algorithms for the maximum-subarray problem on your own computer. What problem size $n_0$ gives the crossover point at which the recursive algorithm beats the brute-force algorithm? Then, change the base case of the recursive algorithm to use the brute-force algorithm whenever the problem size is less than $n_0$. Does that change the crossover point?

This question asks a question that is specific to the implementation, and the computer on which it is run. I will therefore be skipping it in this writeup. It is worthwhile to note that it is almost guarenteed that changing he implementation to use the brute force method for values less than $n_0$ is very likely to change $n_0$.

Question 4.1-4:

Suppose we change the definition of the maximum-subarray problem to allow the result to be an empty subarray, where the sum of the values of an empty subarray is 0. How would you change any of the algorithms that do not allow empty subarrays to permit an empty subarray to be the result?

For the brute force algorithm it would be rather trivial to add a check, and if the return max_sum is > 0 return the empty array.

For the recursive divide and conquer algorithm is is sufficient to just change the $\textit{Find-Max-Crossing-Subarray}$ in a manner similar to the brute force method. If $\textit{Find-Max-Crossing-Subarray}$ return the correct value, then $\textit{Find-Maximum-Subarray}$ will do the correct thing.

Question 4.1-5:

Develop a nonrecursive linear-time algorithm for the maximum-subarray problem.[1]

If one knows a previous answer to the max-subarray problem for a given prefix of the array than any new element consists of only two cases: being part of the maximum subarray or not being part of the maximum subarray. It is easier to explain with pseudocode: max_start = 0max_end = 0max_sum = A[0]max_with_j = A[0]for j in 1..len(A):  # If J is in a maximum-subarray, either j is going to being the maximum on its, or it will will add to the current max  max_with_j = max(A[j], max_with_j + x)  Determine if J is in a maximum-subarray  if max_with_j >= max_sum:    max_sum = max_with_j    max_end = j    #Set the starting value if j is the sole element of a new subarray    if max_with_j == A[j]:      max_start = jreturn (max_start, max_end, cur_max) 

1. The question provides some hints as to the solution of the problem.

## June 15, 2015

### Josef "Jeff" Sipek

#### 2015-06-15

Going under the hood of Inbox — translating Java to ObjC and Javascript

Open-sourcing Facebook Infer: Identify bugs before you ship — a static analysis tool for C, ObjC, and Java

TrainFever — a game inspired by  Transport Tycoon

How Computers Work: A Journey Into the Walk-Through Computer

Why “Agile” and especially Scrum are terrible — a thorough “rant” about agile and scrum and their effects on overall productivity

## June 11, 2015

### Josef "Jeff" Sipek

#### 2015-06-11

O(n) binary search for when O(log n) is just too fast.

A Constructive Look At TempleOS

D3 gallery — chock full of neat visualizations.

The Apollo Guidance Computer: Architecture and Operation seems to be a nice book.

## May 29, 2015

### Josef "Jeff" Sipek

#### 2015-05-29

I’m going to try something new. Instead of sharing individual links per post as I come across them, I’m going to try to dump them whenever I have enough of them. It does mean that some of these links aren’t as “hot off the press”. Here’s the first batch.

How We’re Predicting AI — or Failing To

How Typography Shapes Our Perception Of Truth

Bitcoin mining on a 55 year old IBM 1401 mainframe: 80 seconds per hash

What is the difference between an “aggregate” and other kinds of “modified versions”?

SourceForge grabs GIMP for Windows’ account, wraps installer in bundle-pushing adware

Speed tape looks like duct tape but isn’t.

## May 14, 2015

### Josef "Jeff" Sipek

#### Interactivity During nightly(1)

Every so often, I do a nightly build of illumos on my laptop. This is a long and very CPU intensive process. During the build (which takes about 2.75 hours), the load average is rarely below 8 and most of the time it hovers in the low twenties. (This is a full debug and non-debug build with lint and all the other checking. If I need a build quickly, I can limit it to just what I need and then we’re talking minutes or seconds.)

Anyway, as you might imagine this sort of load puts some pressure on the CPUs. As a result, some interactive processes suffer a bit. Specifically, Firefox doesn’t get enough cycles to render the modern web (read: JavaScript cesspool). Instead of suffering for almost three hours, I just change Firefox’s scheduling class from IA (interactive) to RT (real time):

# priocntl -s -c RT pgrep firefox


This allows Firefox to preempt just about everything on my laptop. This works because Firefox actually yields the CPU properly. This will probably bite me sometime in the future when I end up on a page with such a busted JavaScript turd that it takes over a CPU and won’t let go. Till then, I have a pretty good workaround.

## May 08, 2015

### Nate Berry

#### Still using Arch on my Thinkpad T410s, but not with KDE anymore

Its been a while since I’ve posted anything here so just as an excercise for the fingers I thought I’d post an update about my current machine. I’ve been running Arch on an old Thinkpad T410s for almost a year now ( journalctl says logs started on June 23, 2014). Its an Intel i5 M560 […]

## March 30, 2015

#### FreeBSD SMB Client under OSX Host

I recently purchased a new Macbook Pro and wanted to get a FreeBSD Virtual Machine set up in order to continue doing development work on it. Unfortunately, FreeBSD as a guest does not support native folder sharing so I decided to try using a samba mounted.

I decided to set up my VM to have two network interfaces: a NATed interface for internet access and a host-only interface for access to SMB and ssh.

The NAT networking configuration looks like:

NetworkName:    FreeBSDNatNetworkIP:             10.0.2.1Network:        10.0.2.0/24IPv6 Enabled:   YesIPv6 Prefix:DHCP Enabled:   YesEnabled:        YesPort-forwarding (ipv4)        SSH IPv4:tcp:[]:5022:[10.0.2.4]:22Port-forwarding (ipv6)        FreeBSD ssh:tcp:[]:6022:[fd17:625c:f037:2:a00:27ff:fefc:9dab]:22loopback mappings (ipv4)

The Host-Only networking configuration looks like:

Name:            vboxnet0GUID:            786f6276-656e-4074-8000-0a0027000000DHCP:            DisabledIPAddress:       192.168.56.1NetworkMask:     255.255.255.0IPV6Address:     IPV6NetworkMaskPrefixLength: 0HardwareAddress: 0a:00:27:00:00:00MediumType:      EthernetStatus:          UpVBoxNetworkName: HostInterfaceNetworking-vboxnet0
The FreeBSD configuration looks like this: The OSX sharing configuration looks like:

Unfortunately, when attempting to actually mount the SMB filesystem with: mount_smbfs -I 192.168.56.1 //eax@192.168.56.1/shared_vbox I get the error mount_smbfs: can't get server address: syserr = Operation timed out

I tried installing the package net/samba36 and found that I needed the --signing=off flag to let it work:

It seems based on this setup and research that FreeBSD can not natively mount an OSX samba share. It might be possible to use sysutils/fusefs-smbnetfs. Other people have recommended NFS or sshfs.

## March 01, 2015

### Josef "Jeff" Sipek

#### Concorde

I just came across someone’s blog post full of cool Concorde photos.

It’s a hard choice, but my favorite photo is:

(The black and white photography and the unusual camera position in these images remind me of the Wernher von Braun photo I posted years ago.)

## February 20, 2015

### Josef "Jeff" Sipek

#### Voldemort Collections: Iterating over a key-value store

Aaaand here’s another link that I’ve been sitting on for way too long. I played with Voldemort a bit back in August/September time frame. While playing with it, I found this page which talks about using key-value storage systems for more complex data structures.

#### Practical and Portable x86 Recompilation

About a month ago, I stumbled across this fascinating blog post. I finally got around to sharing it here on my blahg.

## February 04, 2015

### Josef "Jeff" Sipek

#### Raspberry Pi Bootloader

As I mentioned previously, I decided to do some hardware hacking and as a result I ended up with a Raspberry Pi B+. After playing with Raspbian for a bit, I started trying to read up on how the Pi boots. That’s what today’s post is about.

### Standalone UART

While searching for information about how the Pi boots, I stumbled across a git repo with a standalone UART program. Conveniently, the repo includes ELF files as well as raw binary images. (This actually made me go “ewww” at first.)

Before even running it, I looked at the code and saw that it prints out 0x12345678 followed by the PC of one of the first instructions of the program. Pretty minimal, but quite enough.

### Boot Process

Just about everyone on the internet (and their grandmothers!) knows that the Pi has a multi stage boot. First of all, it is the GPU that starts executing first. The ARM core just sits there waiting for a reset.

The Pi requires an SD card to boot — on it must be a FAT formatted partition with a couple of files that you can download from the Raspberry Pi Foundation.

Here’s the short version of what happens:

1. Some sort of baked-in firmware loads bootcode.bin from the SD card and executes it.
2. bootcode.bin does a bit of setup, and then loads and executes start.elf.
3. start.elf does a whole lot of setup.
1. Reads config.txt which is a text file with a bunch of options.
2. Splits the RAM between the GPU and CPU (with the help of fixup.dat).
3. Loads the kernel and the ramdisk.
4. Loads the device tree file or sets up ATAGs.
5. Resets the ARM core. The ARM core then begins executing the kernel.

This sounds pretty nice, right? For the most part, it is. But as they say, the devil’s in the details.

### Booting ELF Files

It doesn’t take a lot of reading to figure out that start.elf can deal with kernel files in ELF format. This made me quite happy since I’ve written an ELF loader before for HVF and while it wasn’t difficult, I didn’t want to go through the exercise just to get something booting.

So, I copied uart02.elf to the SD card, and made a minimal config file:

kernel=uart02.elf


A power-cycle later, I saw the two hex numbers. (Ok, this is a bit of a distortion of reality. It took far too many reboots because I was playing with multiple things at the same time — including U-Boot which turned out to be a total waste of my time.)

The PC was not what I expected it to be. It was 0x8080 instead of 0x800c. After a lot of trial and error, I realized that it just so happened that the .text section is 0x74 bytes into the ELF file. Then I had a horrifying thought… start.elf understands ELF files enough to jump to the right instruction but it does nothing to make the contents of the ELF file properly laid out in memory. It just reads the file into memory starting at address 0x8000, gets the start address from the ELF header, converts it into a file offset and jumps to that location in memory. This is wrong.

Sure enough, booting uart02.bin printed the right number. So much for avoiding writing my own ELF loader.

### Ramdisk

Once I had a reliable way to get code to execute and communicate via the serial port, I started playing with ramdisks. The code I was booting parsed the ATAGs and looked for ATAG_INITRD2 which describes the location of the ramdisk.

So, I extended my config to include the ramfsfile parameter to specify the filename of the ramdisk:

kernel=foo
ramfsfile=initrd


Reboot, aaaand…the code panicked because it couldn’t find ATAG_INITRD2. It was weird, the file was present, no misspellings, etc. After a while of rebooting and trying different things, I decide to use the initramfs option and just pick some arbitrary high address for the ramdisk to be loaded at. The config changed to:

kernel=foo
initramfs initrd 0x800000


Another reboot later, I was looking at a dump of the ATAG_INITRD2. Everything worked as expected! So, it turns out that the boot loader on the Pi is not capable of picking an address for the initial ramdisk by itself.

### Command line

Finally, I just had to experiment with passing a command line string. I created a file called cmdline.txt on the SD card with some text. A reboot later, I saw that the dump of the ATAG_CMDLINE had some cruft prepended. The entire value of the ATAG looked like (with some spaces replaced by line breaks):

dma.dmachans=0x7f35 bcm2708_fb.fbwidth=656 bcm2708_fb.fbheight=416
bcm2708.boardrev=0x10 bcm2708.serial=0xbd225074
bcm2708.disk_led_gpio=47 bcm2708.disk_led_active_low=0
sdhci-bcm2708.emmc_clock_freq=250000000 vc_mem.mem_base=0x1ec00000
vc_mem.mem_size=0x20000000  foo


This isn’t exactly the worst thing, but it does mean that the option parsing code has to handle cruft prepended to what it would expect the command line to look. I wish that these settings were passed via a separate ATAG.

## February 01, 2015

### Josef "Jeff" Sipek

#### Raspberry Pi Serial

Just so I don’t forget, my PL2303HX based TTL serial cable is supposed to be connected to the Raspberry Pi B+ in the following way:

 Wire Color Wire GPIO header Red 5V DC not connected Black GND Pin 6 White RxD Pin 8 (TxD) Green TxD Pin 10 (RxD)

Note that even though the power wire (red) is 5V, the signaling wires (white & green) are 3.3V TTL.

On Illumos, the Prolific chip uses the usbsprl driver.

## January 31, 2015

### Josef "Jeff" Sipek

#### Raspberry Pi

Two weeks ago, I decided to do some hardware hacking. After a bit of reading up on embedded boards, I ended up buying a Raspberry Pi B+. It’s essentially a slightly smaller form factor version of the B, that has more GPIO pins and uses microSD cards instead of SD cards.

I hooked it up to the TV and played with Raspbian and RiscOS a little bit. As you may have guessed by now, that was not enough fun for me. I just had to boot a custom OS that talked over serial. :) This of course required some way to connect the Pi to something that can talk serial. But that’s a post for another day. :P This post is going to be about my impression of the Pi, as well as a cute little use I found for it over the past week.

### Impressions

The Pi is a rather small board. The B+ is even smaller. A lot has been written about the technical side, so I won’t bother.

I was rather impressed with how much punch this little board packs. The hardest part about getting it going was putting it in the case (I got one of those kits because it was cheaper than buying everything separately). The built-in 4-port USB hub ended up quite useful. It allowed me to plug in both a keyboard and a mouse and have NOOBS installing Raspbian and RiscOS within minutes. A quick reboot later, I was at a shell prompt. That’s where the “new toy high” wore off a little. (I know I’ve talked about this with people before — it’s cool to be portable, but it’s also boring since the architecture becomes irrelevant.) I had a shell, and the most creative thing I could think of was to look at /proc/cpuinfo and /proc/meminfo.

I do have some thoughts about where the Pi B+ could have been better. The B version used an SD card. The B+ uses a microSD card. I consider this a bit of a regression. I have a bunch of older SD cards and an SD card reader that works well with SD cards. Sadly, this card reader (using a microSD adapter) fails to play nice with the SDXC modernization of SD that all microSD cards seem to use. I have the same issue with other microSD cards, so I’m pretty sure it’s the card reader. This makes updating a bit more of a pain.

The other thing I wish the Pi had is a DB9 RS232 connector. I have USB serial dongles that work well, but to talk serial to the Pi one needs to either get a level converter or a TTL serial to USB cable. I ended up getting a cheap USB cable with a fake Prolific chip inside. It works, but I hear Windows users are having a terrible time with evil drivers from Prolific.

### Storm Timelapse

A little over a week after getting the Pi in the mail, we got a large storm heading our way. I got the brilliant idea to set up a webcam in an upstairs window. Previously, this would involve digging up an old computer, setting it up by the window, etc. This time, I reached for the Pi. I connected a webcam to one of the USB ports and a cheap WiFi USB adapter to another. A short config later, Raspbian was on the network even though there’s no network drop in sight.

I didn’t want to abuse the microSD card for storage of images, so I mounted an NFS share from the storage server in the basement. I had to use the nolock option to make the mount happen. I probably could have figured out why the lock manager was not running, but it was a temporary setup so a “quick hack” was all I did.

To capture images from the webcam, I ended up installing fswebcam, a small program that does one thing and does it well. I started up screen, and ran fswebcam with the following config.

device /dev/video0
input 0
loop 5
resolution 800x600
timestamp "%Y-%m-%d %H:%M:%S %Z"
jpeg 95
save /mnt/webcam/%Y%m%d/%H/0_%Y%m%d_%H%M%S.jpg
palette YUYV


Then, downstairs on my laptop, I mounted the same share and watched the files appear every five seconds. I ended up running the webcam for two days.

Here’s a couple of stills from the 27th:

And here’s a couple from the 28th:

I did make a quick timelapse, but I haven’t tried to figure out a reasonable set of codec options to not end up with 300 MB of video. Maybe one day I’ll find a good set of options and upload the video here. Here’s what I used:

ffmpeg -framerate 30 -pattern_type glob -i '20150128/*/0_*.jpg' \
-b:v 5000k -g 300 /tmp/out.mp4


Anyway, that’s it for today. I’ll write again about the Pi in the near future — from an OS developer’s perspective.

## January 28, 2015

### dorgan

#### Brining the back-end to the front-end

The latest trend in web seems to be moving the back-end forward.  JavaScript frameworks like AngularJS and a new Service called Back&  are eliminating the back-end/middleware.  Taking the pieces of code that either interactive with databases or web services and moving them client side.  This of course with the help of templating frameworks such as Mustache and Handlebars to handle page layouts and changing markup.

This movement seems to not only be pushing the evolution of ECMA script but also the tools around building these applications npm, node, grunt, bower.

From my experience though, his trend though did not start within the last 2 or even 3 years, but 8 years ago with the release of Sencha's ExtJS Version 2.0.  ExtJS is a library/framework for building web applications, ExtJS has evolved quite a long way from version 2 to its current version 5 evolving into MVC framework and in its latest incarnation a MVVC framework.  The end result for these client side frameworks are that you work with web services and data (JSON, XML).

Moving the front-end/middleware forward though can be risky, you cant just move all your backend/middleware code to the front-end, there are some security concerns that can play into this.

For example if you never thought to secure your web service because it was only accessible on a private network by your web stack well you don't want to put the security itself into the front-end as it easily discovered by view the source of your javascript files.  This is were OAuth type authentication comes into place.  An authentication request is made to the OAuth server and if successful the response contains a token that is used in your web service requests and valid for a given period of time.

But with any luck this migration to the front-end will continue evolving and reduce the layers needs for creating complex websites, services and or applications.

## January 26, 2015

### Nate Berry

#### Printing to Canon MX860 wirelessly from Arch Linux

I’ve been using Arch more often than the Ubuntu machine lately and since I know that my Canon printer is not well supported under Linux I was putting off trying to get it going under Arch. In Ubuntu I was able to use some .deb packages that were hosted on a European Canon website to […]

## January 19, 2015

### Nate Berry

#### Crontab reference

This is just a snippet of text I got from here which I keep around and paste into my crontab on any new machine I set up. If I forget to do this I invariably end up googling it again… every time. # CRONTAB REF # from http://www.adminschoice.com/crontab-quick-reference # # * * * * * […]

## December 08, 2014

### Josef "Jeff" Sipek

#### Debugging with mdb

Recently, Theo Schlossnagle posted two interesting articles about debugging on Illumos using mdb. They are MDB, CTF, DWARF, and other angelic things, and mdb custom dmods.

## December 06, 2014

### Josef "Jeff" Sipek

#### Inline Assembly & clang

Recently I talked about inline assembly with GCC and clang where I pointed out that LLVM seems to produce rather silly machine code. In a comment, a reader asked if this was LLVM’s IR doing this or if it was the machine code generator being silly. I was going to reply there, but the reply got long enough to deserve its own post.

I’ve dealt with LLVM’s IR for a couple of months during the fall of 2010. It was both interesting and quite painful.

The IR is at the  single static assignment level. It assumes that stack space is cheap and infinite. Since it is a SSA form, it has no notion of registers. The optimization passes transform the IR quite a bit and at the end there is very little (if any!) useless code. In other words, I think it is the machine code generation that is responsible for the unnecessary stack frame push and pop. With that said, it is time to experiment.

Using the same test program as before, of course:

#define _KERNEL
#define _ASM_INLINES
#include <sys/atomic.h>

void test(uint32_t *x)
{
atomic_inc_32(x);
}


### Emitting LLVM IR

Let’s compile it with clang passing in the -emit-llvm option to have it generate test.ll file with the LLVM IR:

$clang -S -emit-llvm -Wall -O2 -m64 test.c  There is a fair amount of “stuff” in the file, but the relevant portions are (line-wrapped by me): ; Function Attrs: nounwind define void @test(i32* %x) #0 { entry: tail call void asm sideeffect "lock; incl$0",
"=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
ret void
}

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
"no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false" "no-nans-fp-math"="false"
"stack-protector-buffer-size"="8" "unsafe-fp-math"="false"
"use-soft-float"="false" }


LLVM’s IR happens to be very short and to the point. The function prologue and epilogue are not expressed as part of IR blob that gets passed to the machine code generator. Note the function attribute no-frame-pointer-elim being true (meaning frame pointer elimination will not happen).

Now, let’s add in the -fomit-frame-pointer option.

$clang -S -emit-llvm -Wall -O2 -m64 -fomit-frame-pointer test.c  Now, the relevant IR pieces are: ; Function Attrs: nounwind define void @test(i32* %x) #0 { entry: tail call void asm sideeffect "lock; incl$0",
"=*m,*m,~{dirflag},~{fpsr},~{flags}"(i32* %x, i32* %x) #1, !srcloc !1
ret void
}

attributes #0 = { nounwind uwtable "less-precise-fpmad"="false"
"no-frame-pointer-elim"="false" "no-infs-fp-math"="false"
"no-nans-fp-math"="false" "stack-protector-buffer-size"="8"
"unsafe-fp-math"="false" "use-soft-float"="false" }


The no-frame-pointer-elim attribute changed (from true to false), but the IR of the function itself did not change. (The no-frame-pointer-elim-non-leaf attribute disappeared as well, but it really makes sense since -fomit-frame-pointer is a rather large hammer that just forces frame pointer elimination everywhere and so it doesn’t make sense to differentiate between leaf and non-leaf functions.)

So, to answer Steve’s question, the LLVM IR does not include the function prologue and epilogue. This actually makes a lot of sense given that the IR is architecture independent and the exact details of what the prologue has to do are define by the ABIs.

### IR to Assembly

We can of course use llc to convert the IR into real 64-bit x86 assembly code.

$llc --march=x86-64 test.ll$ gas -o test.o --64 test.s


Here is the disassembly for clang invocation without -fomit-frame-pointer:

test()
test:     55                 pushq  %rbp
test+0x1: 48 89 e5           movq   %rsp,%rbp
test+0x4: f0 ff 07           lock incl (%rdi)
test+0x7: 5d                 popq   %rbp
test+0x8: c3                 ret


And here is the disassembly for clang invocation with -fomit-frame-pointer:

test()
test:     f0 ff 07           lock incl (%rdi)
test+0x3: c3                 ret


### Conclusion

So, it turns out that my previous post simply stumbled across the fact that GCC and clang have different set of optimizations for -O2. GCC includes -fomit-frame-pointer by default, while clang does not.

#### Working with Wide Characters

Two weekends ago, I happened to stumble into a situation where I had a use for wide characters. Since I’ve never dealt with them before, it was an interesting experience. I’m hoping to document some of my thoughts and discoveries in this post.

As you may have guessed, I am using OpenIndiana for development so excuse me if I happen to stray from straight up POSIX in favor of Illumos-flavored POSIX.

The program I was working with happens to read a bunch of strings. It then does some mangling on these strings — specifically, it (1) converts these strings between Unicode and EBCDIC, and (2) at times it needs to uppercase a Unicode character. (Yes, technically the Unicode to EBCDIC conversion is lossy since EBCDIC doesn’t have all possible Unicode characters. Practically, the program only cares about a subset of Unicode characters and those all appear in EBCDIC.)

In the past, most of the code I wrote dealt with Unicode by just assuming the world was ASCII. This approach allows UTF-8 to just work in most cases. Assuming you don’t want to mangle the strings in any major way, you’ll be just fine. Concatenation (strcat), ASCII character search (strchr), and substring search (strstr) all work perfectly fine. While other functions will do the wrong thing (e.g., strlen will return number of bytes, not number of characters).

Converting an ASCII string to EBCDIC is pretty easy. For each input character (aka. each input byte), do a lookup in a 256-element array. The output is just a concatenation of all the looked up values.

This simple approach falls apart if the input is UTF-8. There, some characters (e.g., ö) take up multiple bytes (e.g., c3 b6). Iterating over the input bytes won’t work. One way to deal with this is to process as many bytes as necessary to get a full character (1 for ASCII characters, 2–6 for “non-ASCII” Unicode characters), and then covert/uppercase/whatever it instead of the raw bytes. This sort of hoop-jumping is necessary whenever one wants to process characters instead of bytes.

### wchar_t

Another way to deal with this is to store the string as something other than UTF-8. I took this approach. When the program reads in a (UTF-8) string, it promptly converts it into a wide character string. In other words, instead of my strings being char *, they are wchar_t *. On my system, wchar_t is a 32-bit unsigned integer. This trivially makes all Unicode characters the same length — 32 bits. I can go back to assuming that one element of my string corresponds to a single character. I just need to keep in mind that a single character is not one byte. In practice, this means remembering to malloc more memory than before. In other words:

wchar_t *a, *b;

a = malloc(MAX_LEN);                   /* WRONG */
b = malloc(sizeof(wchar_t) * MAX_LEN); /* CORRECT */


Uppercasing a character becomes just as easy as it was with plain ol’ ASCII. For example, to uppercase the letter in a string:

void uppercase_nth(wchar_t *str, int i)
{
str[i] = toupper(str[i]);
}


There are however some downsides. First and foremost, if you are dealing mostly with ASCII, then your memory footprint may have just quadrupled. (In my case, the program is so small that I don’t care about the memory footprint increase.) Second, you have to deal with a couple of “silly” syntax to make the (C99) compiler realize what it is you are attempting to do.

const wchar_t *msg = L"Lorem ipsum";
const wchar_t *letter = L'x';


### “str” functions

Arguably, the most visible change involves the “str” functions. With plain old ASCII strings, you use functions like strlen, strcpy, and strcat to, respectively, get the length, copy a string, and concatenate two strings. These functions assume that each byte is a character and that the string is terminated by a null (8-bit 0) so they do not work in the world of wide characters. (Keep in mind that since ASCII consists of characters with values less than 128, a 32-bit integer with that value will have three null bytes in most characters (assuming ASCII text). On big endian systems, you’ll end up with the empty string, while on little endian systems you’ll end up with a string consisting of just the first character.) Thankfully, there are alternatives to the “str” functions that know how to deal with wide character strings — the “ws” functions. Instead of using strlen, strcpy, and strcat, you want to call wslen, wscpy, and wscat. There are of course more. On Illumos, you can look at the wcstring(3c) manpage for many (but not all!) of them.

### printf & friends

Manipulating strings solely with the “str” functions is tedious. Often enough, it is so much simpler to reach for the venerable printf. This is where things get really interesting. The printf family of functions knows how to convert between char * strings and wchar_t * strings. First of all, let’s take a look at snprintf (the same applies to printf and sprintf). Here’s a simple code snippet that dumps a string into a char array. The output is char *, the format string is char *, and the string input is also char *.

char output[1024];
char *s = "abc";

snprintf(output, sizeof(output), "foo %s bar\n", s);


One can use %ls to let snprintf know that the corresponding input string is a wide character string. snprintf will do everything the same, except it transparently converts the wide character string into a regular string before outputting it. For example:

char output[1024];
wchar_t *s = L"abc";

snprintf(output, sizeof(output), "foo %ls bar\n", s);


Will produce the same output as the previous code snippet.

Now, what if you want the output to be a wide character string? Simple, use the wprintf functions! There are fwprintf, wprintf, and swprintf which correspond to fprintf, printf, and snprintf. Do note that the wide-character versions want the format string to be a wide character string. As far as the format string is concerned, the same rules apply as before — %s for char * input and %ls for wchar_t * input:

wchar_t output[1024];
wchar_t *s1 = L"abc";
char *s2 = "abc";

swprintf(output, sizeof(output), L"foo %ls %s bar\n", s1, s2);


Caution! In addition to swprintf there is also wsprintf. This one takes the format string in char * but outputs into a wchar_t * buffer.

Here’s the same information, in a tabular form. The input string type is always determined by the format string contents — %s for char * input and %ls for wchar_t * input:

 Function Output Format string printf, sprintf, snprintf, fprintf char * char * wprintf, swprintf, fwprintf wchar_t * wchar_t * wsprintf wchar_t * char *

### setlocale and Summary

Oh, I almost forgot! You should call setlocale before you start using all these features.

So, to conclude, it is pretty easy to use wide character strings.

• #include <wchar.h>
• #include <widec.h>
• call setlocale in your main
• use wchar_t instead of char
• use %ls in format strings instead of %s
• use L string literal prefix
• beware of wsprintf and swprintf

I wouldn’t want to deal with this sort of code on daily basis, but for a random side project it isn’t so bad. I do like the ability to not worry about the encoding — the 1:1 mapping of characters to array elements is really convenient.

## December 01, 2014

### Josef "Jeff" Sipek

#### Delegating mount/umount Privileges

Recently, I was doing some file system changes. Obviously, I wanted to run them as an unprivileged user. Unfortunately, the test involved mounting and unmounting a filesystem (tmpfs to be specific). At first I was going to set up a sudo rule to allow mount and umount to run without asking for a password. Then I remembered that I should be able to give the unprivileged user the additional privileges. It turns out that there is only one privilege (sys_mount) necessary to delegate…and it is easy to do!

$usermod -K defaultpriv=basic,sys_mount jeffpc  Then it’s a matter of logging out and back in. We can check using ppriv: $ ppriv 
925:    bash
flags = <none>
E: basic,sys_mount
I: basic,sys_mount
P: basic,sys_mount
L: all


At this point, mounting and unmounting works without sudo or similar user switching:

$mkdir tmp$ mount -F tmpfs none /tmp/tmp
$df -h /tmp/tmp Filesystem Size Used Avail Use% Mounted on swap 2.6G 0 2.6G 0% /tmp/tmp  ## November 27, 2014 ### Josef "Jeff" Sipek #### Inline Assembly & GCC, clang Recently, I got to write a bit of inline assembly. In the process I got to test my changes by making a small C file which defined test function that called the inline function from the header. Then, I could look at the disassembly to verify all was well. #define _KERNEL #define _ASM_INLINES #include <sys/atomic.h> void test(uint32_t *x) { atomic_inc_32(x); }  GCC has been my go to complier for a long time now. So, at first I was using it to debug my inline assembly. I compiled the test programs using: $ gcc -Wall -O2 -m64 -c test.c


Disassembling the object file yields the rather obvious:

test()
test:     f0 ff 07           lock incl (%rdi)
test+0x3: c3                 ret


I can’t think of any way to make it better :)

Then, at some point I remembered that Clang/LLVM are pretty good as well. I compiled the same file with clang:

$clang -Wall -O2 -m64 -c test.c  The result was rather disappointing: test() test: 55 pushq %rbp test+0x1: 48 89 e5 movq %rsp,%rbp test+0x4: f0 ff 07 lock incl (%rdi) test+0x7: 5d popq %rbp test+0x8: c3 ret  For whatever reason, Clang feels the need to push/pop the frame pointer. I did a little bit of searching, and I couldn’t find a way to disable this behavior. The story for 32-bit output is very similar (just drop the -m64 from the compiler invocation). GCC produced the superior output: test() test: 8b 44 24 04 movl 0x4(%esp),%eax test+0x4: f0 ff 00 lock incl (%eax) test+0x7: c3 ret  While Clang still wanted to muck around with the frame pointer. test() test: 55 pushl %ebp test+0x1: 89 e5 movl %esp,%ebp test+0x3: 8b 45 08 movl 0x8(%ebp),%eax test+0x6: f0 ff 00 lock incl (%eax) test+0x9: 5d popl %ebp test+0xa: c3 ret  For the curious ones, I’m using GCC 4.8.3 and Clang 3.4.2. I realize this is a bit of a special case (how often to you make a function that simply calls an inline function?), but it makes me worried about what sort of sub-optimal code Clang produces in other cases. ## November 15, 2014 ### Nate Berry #### Using i3 on my Arch USB flash drive Back in February I wrote about setting up a bootable USB stick with Arch Linux. At the time I was using it with a Dell laptop, but since then have been running it mainly off an old Thinkpad T410s (with a now totally non-functional power cable and a cracked palmrest) that had been retired from […] ## October 03, 2014 ### dorgan #### MyFancyPrincess.com Ruins Halloween!! Steer clear of MyFancyPrincess.com. On July 13, 2014 my mother placed an order for the Anna/Elsa dresses that were on pre-order as she wanted to get the Anna/Elsa dresses for my twin daughters (age 2) so we directed her there as the reviews of the company were great and it pricing was great. We all understood that this was a pre-order item and would not be receiving it right away. Some time went by and I wanted to follow up with www.myfancyprincess.com so I asked my mom to forward me the confirmation email so that I could get the order number and follow up with them. So she did and I sent the following email on 8/25/14 to follow up on the order: I am trying to follow up on the order status for: myfancyprincess-xxxxx I know the order was for pre-order, so I just wanted to follow up on the status for the actual items. On the same day I received the following response: According to the date you ordered, you are in our third presale shipment which is not due to arrive here at our location until the end of August/early September. Once we receive the shipment and check it in, then we will ship in the order received. If there are no further delays you should see a shipping confirmation somewhere around mid to third week of September. Thank you for your business and your continued patience. We sincerely appreciate it! Excellent, a quick reply and an approximate date of when to expect the dresses. So some more time goes by and we hadn't received the dresses yet, so I called my mom and asked her if she had heard anything, and she had not. So on September 23, 2014 I sent a quick email: I just wanted to follow up on this order as we still have not received anything and it is now the end of September. Another 5 days went by without any type of response, so on 9/26/14 I sent another email as their phone system says the way to get the quick response is to email them: I have not received my order not a response from you this week, not quite sure what is going on. On 10/01/14 still no response, so at this point we call and leave a voice message, as well as send another email: I am sending yet another email to follow up on this order. Please its getting close to halloween and it was our understanding that we would have these items by now..... So no response for most of the day on 10/02/14 so I send the following email, granted its definitely confrontational, but all I am looking for is a status update: So yet another week has gone by. Both my wife and I have called and left messages as well as sent email and NOTHING has been responded to. This is totally UNACCEPTABLE and I will start a social media campaign soon if I don't hear back. Also being slightly disgruntled at this point, I figured I would try another contact medium, Facebook. So I posted a message along the same lines as my previous emails. (Which has now been deleted). That seem to get their attention and I received the following email reply. Any details regarding the order are released to the purchaser only. We see we previously responded to an e-mail but that was a mistake. Thank you. To which I responded: Ok please email the purchaser (my mom) with an update and I will contact her to get the details. And they comment on my Facebook feed also that they have responded to my email as well as forwarded the information along to my Mom, great an update, we are happy. My mom then tells me that the email states the dresses will not be shipping for another 1-2 weeks and then we'll receive them 5 days after that. They also explained that this is not their fault and that it is the fault of their manufacturer/distributor. So hey what are you going to do, so we just have to wait. Well it seems that they didn't like some of the negative comments that some of my friends/family put in the thread that I had started with them. So they deleted the post. And sent the following email to my mom, the original purchaser: We have gone ahead and canceled this order. Order delays from our supplier are not our fault and we will not continue to be bashed publicly for something that is not out fault. we just spoke to our supplier yesterday and they are the ones delaying, NOT us. We have explained this just this morning to your husband (I think they meant son) who also tried to publicly shame us for this. We explained it respectfully and nicely. Yet you felt the need to once again publicly bash us for what we already explained was not our fault. We are just as upset over this as you are and have on more then one occasion expressed out disappointment that we are the ones taking al the blame for the delays that are not our fault. We also gave you an option to switch to other in stock dresses (Double the cost) and instead of e-mailing us to work something out you once again went on our page to publicly express your disappointment (My mom, the actual purchaser, never posted on the page, my wife did when she saw they deleted my post to their page). You have every right to be disappointed, but please understand that we did not cause this. You have been refunded in full and the order is now cancelled. Well that got me really pissed as they could have use the opportunity to shine in a customer service issue, and they chose not to. So I looked back at some of their Facebook posts to see if anyone else was complaining on their Facebook page and found a recent one within the last day or two and commented on that posting stating to be careful what they post as if they find it "offensive" or that it is "bashing" them they would cancel your order. Since that comment their Facebook page is now completely locked down, no commenting, no liking and no posting. Ultimately they should have been sending status updates on these pre-order items, thats the right thing to do. They also could have used the Facebook posts to shine in customer service but chose to hide everything in email. I am sorry but if you are on social media you must take the good and the bad with it. You can't just delete/hide everything that you don't like you have to use it as a tool to show everyone else how you can treat the customer with respect. In the end who suffers, my 2 year old twin daughters, as we are really close to Halloween and no one else has these costumes in their size. My wife just told me that she let my daughter Sarah know that she might have to be something else for Halloween and she started to cry. Shame on you www.myfancyprincess.com!! ## September 03, 2014 ### Eitan Adler #### Finding the majority element in a stream of numbers Some time ago I came across the following question. As input a finite stream stream of numbers is provided. Define an algorithm to find the majority element of the input. The algorithm need not provide a sensible result if no majority element exists. You may assume a transdichotomous memory model. There are a few definitions which may not be immediately clear: Stream A possibly infinite set of data which may not be reused in either the forward or backward direction without explicitly storing it. Majority element An element in a set which occurs more than half the time. Transdichotomous The integer size is equal to the word size of memory. One does not need to worry about storing partial pieces of integers in separate memory units. Unfortunately this answer isn't of my own invention, but it is interesting and succinct. The algorithm (click to view)Using 3 registers the accumulator, the guess and the current element (next): 1. Initialize accumulator to 0 2. Accept the next element of the stream and place it into next. If there are no more elements go to step #7. 3. If accumulator is 0 place next into guess and increment accumulator. 4. Else if guess matches next increment accumulator 5. Else decrement accumulator 6. Go to step 2 7. Return the value in guess as the result An interesting property of this algorithm is that it can be implemented in$O(n)$time even on a single tape Turing Machine. ## August 13, 2014 ### Josef "Jeff" Sipek #### Serial Console in a Zone In the past, I’ve talked about serial consoles. I have described how to set up a serial console on Solaris/OpenIndiana. I’ve talked about Grub’s composite console in Illumos-based distros. This time, I’m going do describe the one trick necessary to get tip(1) in a zone working. In my case, I am using SmartOS to run my zones. Sadly, SmartOS doesn’t support device pass-through of this sort, so I have to tweak the zone config after I create the zone with vmadm. Let’s assume that the serial port I want to pass through is /dev/term/a. Passing it through into a zone is as easy as: [root@isis ~]# zonecfg -z 7cff99f6-2b01-464d-9f72-d0ef16ce48af zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af> add device zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af:device> set match=/dev/term/a zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af:device> end zonecfg:7cff99f6-2b01-464d-9f72-d0ef16ce48af> commit  At this point, you’ll probably want to reboot the zone (I don’t remember if it is strictly necessary). Once it is back up, you’ll want to get into the zone and point your software of choice at /dev/term/a. It doesn’t matter that you are in a zone. The same configuration rules apply — in my case, it’s the same change to /etc/remote as I described previously. #### Inlining Atomic Operations One of the items on my ever growing TODO list (do these ever shrink?) was to see if inlining Illumos’s atomic_* functions would make any difference. (For the record, these functions atomically manipulate variables. You can read more about them in the various man pages — atomic_add, atomic_and, atomic_bits, atomic_cas, atomic_dec, atomic_inc, atomic_or, atomic_swap.) Of course once I looked at the issue deeply enough, I ended up with five cleanup patches. The gist of it is, inlining them caused not only about 1% kernel performance improvement on the benchmarks, but also reduced the kernel size by a couple of kilobytes. You can read all about it in the associated bugs (5042, 5043, 5044, 5045, 5046, 5047) and the patch 0/6 email I sent to the developer list. In this blahg post, I want to talk about how exactly Illumos presents these atomic functions in a stable ABI but at the same time allows for inlines. ### Genesis It should come as no surprise that the “content” of these functions really needs to be written in assembly. The functions are 100% implemented in assembly in usr/src/common/atomic. There, you will find a directory per architecture. For example, in the amd64 directory, we’ll find the code for a 64-bit atomic increment:  ENTRY(atomic_inc_64) ALTENTRY(atomic_inc_ulong) lock incq (%rdi) ret SET_SIZE(atomic_inc_ulong) SET_SIZE(atomic_inc_64)  The ENTRY, ALTENTRY, and SET_SIZE macros are C preprocessor macros to make writing assembly functions semi-sane. Anyway, this code is used by both the kernel as well as userspace. I am going to ignore the userspace side of the picture and talk about the kernel only. These assembly functions, get mangled by the C preprocessor, and then are fed into the assembler. The object file is then linked into the rest of the kernel. When a module binary references these functions the krtld (linker-loader) wires up those references to this code. ### Inline Replacing these function with inline functions (using the GNU definition) would be fine as far as all the code in Illumos is concerned. However doing so would remove the actual functions (as well as the symbol table entries) and so the linker would not be able to wire up any references from modules. Since Illumos cares about not breaking existing external modules (both open source and closed source), this simple approach would be a no-go. ### Inline v2 Before I go into the next and final approach, I’m going to make a small detour through C land. #### extern inline First off, let’s say that we have a simple function, add, that returns the sum of the two integer arguments, and we keep it in a file called add.c: #include "add.h" int add(int x, int y) { return x + y; }  In the associated header file, add.h, we may include a prototype like the following to let the compiler know that add exists elsewhere and what types to expect. extern int add(int, int);  Then, we attempt to call it from a function in, say, test.c: #include "add.h" int test() { return add(5, 7); }  Now, let’s turn these two .c files into a .so. We get the obvious result — test calls add: test() test: be 07 00 00 00 movl$0x7,%esi
test+0x5: bf 05 00 00 00     movl   $0x5,%edi test+0xa: e9 b1 fe ff ff jmp -0x14f <0xc90>  And the binary contains both functions: $ /usr/bin/nm test.so | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[74]	|                3520|                   4|FUNC |GLOB |0    |13   |add
[65]	|                3536|                  15|FUNC |GLOB |0    |13   |test


Now suppose that we modify the header file to include the following (assuming GCC’s inline definition):

extern int add(int, int);

extern inline int add(int a, int b)
{
return a + b;
}


If we compile and link the same .so the same way, that is we feed in the object file with the previously used implementation of add, we’ll get a slightly different binary. The invocation of add will use the inlined version:

test()
test:     b8 0c 00 00 00     movl   $0xc,%eax test+0x5: c3 ret  But the binary will still include the symbol: $ /usr/bin/nm test.so | egrep '(Value|test$|add$)'
[Index]   Value                Size                Type  Bind  Other Shndx Name
[72]	|                3408|                   4|FUNC |GLOB |0    |11   |add
[63]	|                3424|                   6|FUNC |GLOB |0    |11   |test


Neat, eh?

#### extern inline atomic what?

How does this apply to the atomic functions? Pretty simply. As I pointed out, usr/src/common/atomic contains the pure assembly implementations — these are the functions you’ll always find in the symbol table.

The common header file that defines extern prototypes is usr/src/uts/common/sys/atomic.h.

Now, the trick. If you look carefully at the header file, you’ll spot a check on line 39. If all the conditions are true (kernel code, GCC, inline assembly is allowed, and x86), we include asm/atomic.h — which lives at usr/src/uts/intel/asm/atomic.h. This is where the extern inline versions of the atomic functions get defined.

So, kernel code simply includes <sys/atomic.h>, and if the stars align properly, any atomic function use will get inlined.

Phew! This ended up being longer than I expected. :)

## August 12, 2014

### Josef "Jeff" Sipek

#### Grub Composite Console

In the past, I’ve described how to get a serial console going on Illumos based systems. If you ever used a serial console in Grub (regardless of the OS you ended up booting), you probably know that telling Grub to output to a serial port causes the VGA console to become totally useless — it’s blank.

Well, if you are using Illumos, you are in luck. About 5 months ago, Joyent integrated a “composite console” in Grub. You can read the full description in the bug report/feature request. The short version is: all grub output can be sent to both the VGA console as well as over a serial port.

It is very easy to configure. In your menu.lst, change the terminal to composite. For example, this comes from my test box’s config file (omitting the uninteresting bits):

serial --unit=0 --speed=115200
terminal composite


Note the use of composite instead of serial. That’s all there is to it.

## August 06, 2014

### Josef "Jeff" Sipek

#### Operating Systems: Three Easy Pieces

I just found out that Remzi and Andrea decided to write a textbook about operating systems. This is exciting for several reasons. Here are the top two.

First and foremost, the book is free. That’s right, a textbook that is free when every other computer science textbook is easily around $100. Why? I’ll let Remzi make the case. Long story short, publishing a textbook isn’t about making money. It is about sharing ideas. You can download it from the textbook’s website. Second, the book is by Remzi and Andrea. This pair of professors from the University of Wisconsin is responsible for a ton of amazing storage related research. If you don’t believe me, check out their publication track record. I suppose I should mention that I have read only very little of the book, but I did push it onto the top of my to-read stack and I’m slowly making my way through it. I’ll let you all know how it goes. ## August 04, 2014 ### Josef "Jeff" Sipek #### Lua Compatibility Phew! Yesterday afternoon, I decided to upgrade my laptop’s OpenIndiana from 151a9 to “Hipster”. I did it in a bit convoluted way, and hopefully I’ll write about that some other day. In the end, I ended up with a fresh install of the OS with X11 and Gnome. If you’ve ever seen my monitors, you know that I do not use Gnome — I use Notion. So, of course I had it install it. Sadly, OpenIndiana doesn’t ship it so it was up to me to compile it. After the usual fight to get a piece of software to compile on Illumos (a number of the Solaris-isms are still visible), I got it installed. A quick gdm login later, Notion threw me into a minimal environment because something was exploding. After far too many hours of fighting it, searching online, and trying random things, I concluded that it was not Notion’s fault. Rather, it was something on the system. Eventually, I figured it out. Lua 5.2 (which is standard on Hipster) is not compatible with Lua 5.1 (which is standard on 151a9)! Specifically, a number of functions have been removed and the behavior of other functions changed. Not being a Lua expert (I just deal with it whevever I need to change my window manager’s configuration), it took longer than it should but eventually I managed to get Notion working like it should be. So, what sort of incompatibilies did I have to work around? ### loadstring loadstring got renamed to load. This is an easy to fix thing, but still a headache especially if you want to support multiple versions of Lua. ### table.maxn table.maxn got removed. This function returned the largest positive integer key in an associative array (aka. a table) or 0 if there aren’t any. (Lua indexes arrays starting at 1.) The developers decided that it’s so simple that those that want it can write it themselves. Here’s my version: local function table_maxn(t) local mn = 0 for k, v in pairs(t) do if mn < k then mn = k end end return mn end  ### table.insert table.insert now checks bounds. There doesn’t appear to be any specific way to get old behavior. In my case, I was lucky. The index/positition for the insertion was one higher than table_maxn returned. So, I could replace: table.insert(ret, pos, newscreen)  with: ret[pos] = newscreen  ### Final Thougths I can understand wanting to deprecate old crufty interfaces, but I’m not sure that the Lua developers did it right. I really think they should have marked those interfaces as obsolete, make any use spit out a warning, and then in a couple of years remove it. I think that not doing this, will hurt Lua 5.2’s adoption. Yes, I believe there is some sort of a compile time option for Lua to get legacy interfaces, but not everyone wants to recompile Lua because the system installed version wasn’t compiled quite the way that would make things Just Work™. ## August 02, 2014 ### Josef "Jeff" Sipek #### Generating Random Data Over the years, there have been occasions when I needed to generate random data to feed into whatever system. At times, simply using /dev/random or /dev/urandom was sufficient. At other times, I needed to generate random data at a rate that exceeded what I could get out of /dev/random. This morning, I read Chris’s blog entry about his need for generating lots of random data. I decided that I should write my favorite approach so that others can benefit. The approach is very simple. There are two phases. First, we set up our own random pool. Second we use the random pool. I am going to use an example throughout the rest of this post. Suppose that we want to make repeated 128 kB writes to a block device and we want the data to be random so that the device can’t do anything clever (e.g., compress or dedup). Say that during this benchmark we want to write out 64 GB total. (In other words, we will issue 524288 writes.) ### Setup Phase During the setup phase, we create a pool of random data. The easiest way is to just read /dev/urandom. Here, we want to read enough data so that the pool is large enough. For our 128kB write example, we’d want at least 1 MB. (I’ll explain the sizing later. I would probably go with something like 8 MB because unless I’m on some sort of limited system, the extra 7 MB of RAM won’t be missed.) ### “Using the Pool” Phase Now that we have the pool, we can use it to generate random buffers. The basic idea is to pick a random offset into the pool and just grab the bytes starting at that location. In our example, we’d pick a random offset between zero and pool size minus 128 kB, and use the 128 kB at that offset. In pseudo code: #define BUF_SIZE (128 * 1024) #define POOL_SIZE (1024 * 1024) static char pool[POOL_SIZE]; char *generate() { return &pool[rand() % (POOL_SIZE - BUF_SIZE)]; }  That’s it! You can of course make it more general and let the caller tell you how many bytes they want: #define POOL_SIZE (1024 * 1024) static char pool[POOL_SIZE]; char *generate(size_t len) { return &pool[rand() % (POOL_SIZE - len)]; }  It takes a pretty simple argument to show that even a modest sized pool will be able to generate lots of different random buffers. Let’s say we’re dealing with the 128 kB buffer and 1 MB pool case. The pool can return 128 kB starting at offset 0, or offset 1, or offset 2, … or offset 9175043 (). This means that there are 917504 possible outputs. Recall, that in our example we were planning on writing out 64 GB in total which was 524288 writes. In other words, we are planning on using less than 58% of the possible outputs from our 1 MB pool to write out 64 GB of random data! (An 8 MB pool would yield 6.3% usage.) If the length is variable, the math gets more complicated, but in a way we get even better results (i.e., lower usage) because to generate the same buffer we would need have the same offset and length. If the caller supplies random (pseudo-random or based on some distribution) lengths, we’re very unlikely to get the same buffer out of the pool. ### Observations Some of you may have noticed that we traded generating 128 kB (or a user supplied length) of random data for generating a random integer. There are two options there, either you can use a fast pseudo-random number generator (like the Mersenne twister), or you can reuse same pool! In other words: #define POOL_SIZE (1024 * 1024) static char pool[POOL_SIZE]; static size_t *ridx = (size_t *) pool; char *generate(size_t len) { if ((uintptr_t) ridx == (uintptr_t)&pool[POOL_SIZE]) ridx = (size_t *) pool; ridx++; return &pool[ridx % (POOL_SIZE - len)]; }  I leave it as an exercise for the reader to either make it multi-thread safe, or to make the index passed in similarly to how rand_r takes an argument. ### rand_r Considered Harmful Since we’re on the topic of random number generation, I thought I’d mention what is already rather widely known fact — libc’s rand and rand_r implementations are terrible. At one point, I tried using them with this approach to generate random buffers, but it didn’t take very long before I got repeats! Caveat emptor. ## July 23, 2014 ### Josef "Jeff" Sipek #### Segment Drivers Lately, I started poking around the Illumos memory management code. As I’ve done in the past, I decided to use this blahg as a place to document some of my discoveries. ### Memory Layout In Illumos (and Solaris), address spaces are managed as sets of segments. Each segment has a base address, length, and a number of other properties. This is true for both process memory as well as kernel memory. Do not confuse these segments with memory segmentation that processors like x86 provide. Each process has its own struct as: > ::pgrep vim S PID PPID PGID SID UID FLAGS ADDR NAME R 10852 10777 10850 10777 101 0x4a004000 ffffff0411e1c0a0 vim > ffffff0411e1c0a0::print proc_t p_as | ::print struct as a_segtree a_segtree = { a_segtree.avl_root = 0xffffff03f7c62ea8 a_segtree.avl_compar = as_segcompar a_segtree.avl_offset = 0x20 a_segtree.avl_numnodes = 0x18 a_segtree.avl_size = 0x60 }  The kernel address space is maintained in the kas global: > kas::print a_segtree a_segtree = { a_segtree.avl_root = kvseg+0x20 a_segtree.avl_compar = as_segcompar a_segtree.avl_offset = 0x20 a_segtree.avl_numnodes = 0x9 a_segtree.avl_size = 0x60 }  (Once upon a time this set of segments was a linked list, but for a long while now it has been an AVL tree indexed by the base address.) Regardless of which address space we’re dealing with, the same rules apply: segments represent contiguous regions within the address space. Each segment can represent a different type of memory. For example, walking the kernel address space segment tree yields nine different segments of four different types (kpm, kmem, kp, and map): > kas::print a_segtree | ::walk avl | ::printf "%p.%016x %a\n" "struct seg" s_base s_size s_ops fffffe0000000000.000000031e000000 segkpm_ops ffffff0000000000.0000000017000000 segkmem_ops ffffff0017000000.0000000080000000 segkp_ops ffffff0097000000.00000002fca00000 segkmem_ops ffffff03d3a00000.0000000004000000 segmap_ops ffffff03d7a00000.000000fbe8600000 segkmem_ops ffffffffc0000000.000000003b7fb000 segkmem_ops fffffffffb800000.0000000000550000 segkmem_ops ffffffffff800000.0000000000400000 segkmem_ops  ### Segment Drivers Illumos comes with seven different architecture- and platform-independent segment drivers. A segment driver is a “driver” that implements a couple of functions to manage a segment of memory. That is, each segment type can handle page faults, page locking, sync operations, etc. differently. For example, suppose that a page fault occurs because a process tried to load a value from a page that lacks a page table entry. The platform specific (assembly) fault handling code gets invoked by the processor. After doing a little bit of work, it calls into the generic (C) fault handling code, as_fault. There, the segtree AVL tree is consulted and the corresponding segment’s fault operation gets invoked. (Solaris Internals lists 12 and 11 segment drivers, respectively, in the two editions.) In Illumos, the seven common segment drivers are: seg_dev Most of the time, userspace processes do not need to map devices into their address space. In the rare case when a process does want a device mapped (e.g., Xorg), the dev segment driver maintains that mapping. seg_kmem This segment driver maps the kernel heap, module text, and all early boot memory. (code) seg_kp In general, kernel memory is not pageable. In the rare case that something can be in kernel pageable memory, this segment is what maintains the anonymous page mappings. seg_kpm If possible (you’re on a 64-bit system), the kpm segment driver maps all physical memory into the kernel’s address space. This allows the kernel to not have to set up temporary mappings to operate on physical memory. (code) seg_map The map segment driver is a kernel-only higher performance version of the vn segment driver. (See below.) seg_spt This segment driver is responsible for maintaining SysV shared memory segments. (Not to be confused with POSIX shared memory.) seg_vn Memory mapped files are handled by the vn segment driver. This includes both regular files as well as anonymous memory. There are also two platform specific segment drivers: seg_mf (i86xpv only) This segment driver is only used by dom0 processes (read: Xen) to map pages from other domains. seg_nf (sparc v9 only) The header for the file says that it is for non-faulting loads. I don’t actually know what exactly it is for. (And I don’t care enough to dig deeper given that it is Sparc specific.) ### The Reality This is a lot of different segment drivers. Are all of them used all the time? Well, sort of. The mdb output earlier shows that the (amd64) kernel uses only four different segment drivers (kpm, kmem, kp, and map). A typical userspace process is very boring — it is only made up of vn segments. There are, however, exceptions. For instance, Xorg uses vn and dev. This accounts for six of the seven drivers. The last common segment driver is spt, which provides System V shared memory. (I talked about SysV shared memory previously.) So, on a 64-bit x86 system, all seven common segment drivers are in use. The story is a bit different on 32-bit kernels. Since a 32-bit system has much smaller address space, the kernel tries to eliminate a number of mappings. Here is the list of segments in a 32-bit kernel: > kas::print a_segtree | ::walk avl | ::printf "%p %a\n" "struct seg" s_base s_ops b5802000 segmap_ops b6800000 segkmem_ops ef400000 segkmem_ops fe800000 segkmem_ops ff000000 segkmem_ops  As you can see, the kp and kpm segments went away. While at first this is surprising, it actually makes perfect sense. When thinking about memory there are two “types” to consider: physical and virtual. In theory, one can have more virtual than physical thanks to the MMU but in reality this is only true on 64-bit systems. The physical memory sizes have outgrown 4 GB a number of years ago and therefore a 32-bit address space can trivially be 100% backed by physical memory. In other words, 32-bit address spaces are tight on virtual memory, while 64-bit address spaces are “tight” on physical memory. Let’s consider the disappearance of the kp segment on 32-bits. What does kp let us do? It lets us oversubscribe physical memory by backing some virtual memory with disk space. On 32-bit systems we have enough physical memory to back all the virtual memory in the kernel so we don’t need to back some of it by disk. So we have no use for it. (Yes, the kernel still could have paged parts of itself out, but kernel text and data is generally considered important enough to keep it in non-pageable memory. The memory utilization will more than pay for itself by the performance improvement of not having the kernel paged out.) As I stated before, kpm segments map physical memory into the kernel’s address space for performance reasons (without it the kernel would have to temporarily map a page to access the contents). Therefore, they are good candidates for removal when it comes to slimming down the kernel’s address space demands. (Well, the actual story is the other way… the introduction of 64-bit capable hardware allowed kpm segments to exist to improve kernel performance.) ## July 14, 2014 ### Josef "Jeff" Sipek #### Unix Shared Memory While investigating whether some memory management code was still in use (I’ll blahg about this in the future), I ended up learning quite a bit about shared memory on Unix systems. Since I managed to run into a couple of non-obvious snags while trying to get a simple test program running, I thought I’d share my findings here for my future self. All in all, there are three ways to share memory between processes on a modern Unix system. ### System V shm This is the oldest of the three. First you call shmget to set up a shared memory segment and then you call shmat to map it into your address space. Here’s a quick example that does not do any error checking or cleanup: void sysv_shm() { int ret; void *ptr; ret = shmget(0x1234, 4096, IPC_CREAT); printf("shmget returned %d (%d: %s)\n", ret, errno, strerror(errno)); ptr = shmat(ret, NULL, SHM_PAGEABLE | SHM_RND); printf("shmat returned %p (%d: %s)\n", ptr, errno, strerror(errno)); }  What’s so tricky about this? Well, by default Illumos’s shmat will return EPERM unless you are root. This sort of makes sense given how this flavor of shared memory is implemented. (Hint: it’s all in the kernel) ### POSIX shm As is frequently the case, POSIX came up with a different interface and different semantics for shared memory. Here’s the POSIX shm version of the above function: void posix_shm() { int fd; void *ptr; fd = shm_open("/blah", O_RDWR | O_CREAT, 0666); printf("shm_open returned %d (%d: %s)\n", fd, errno, strerror(errno)); ftruncate(fd, 4096); /* IMPORTANT! */ ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); printf("mmap returned %p (%d: %s)\n", ptr, errno, strerror(errno)); }  The very important part here is the ftruncate call. Without it, shm_open may create an empty file and mmaping an empty file won’t work very well. (Well, on Illumos mmap succeeds, but you effectively have a 0-length mapping so any loads or stores will result in a SIGBUS. I haven’t tried other OSes.) Aside from the funny looking path (it must start with a slash, but cannot contain any other slashes), shm_open looks remarkably like the open system call. It turns out that at least on Illumos, shm_open is implemented entirely in libc. The implementation creates a file in /tmp based on the path provided and the file descriptor that it returns is actually a file descriptor for this file in /tmp. For example, “/blah” input translates into “/tmp/.SHMDblah”. (There is a second file “/tmp/.SHMLblah” that doesn’t live very long. I think it is a lock file.) The subsequent mmap call doesn’t have any idea that this file is special in any way. Does this mean that you can reach around shm_open and manipulate the object directly? Not exactly. POSIX states: “It is unspecified whether the name appears in the file system and is visible to other functions that take pathnames as arguments.” The big difference between POSIX and SysV shared memory is how you refer to the segment — SysV uses a numeric key, while POSIX uses a path. ### mmap The last way of sharing memory involves no specialized APIs. It’s just plain ol’ mmap on an open file. For completeness, here’s the function: void mmap_shm() { int fd; void *ptr; fd = open("/tmp/blah", O_RDWR | O_CREAT, 0666); printf("open returned %d (%d: %s)\n", fd, errno, strerror(errno)); ftruncate(fd, 4096); /* IMPORTANT! */ ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); printf("mmap returned %p (%d: %s)\n", ptr, errno, strerror(errno)); }  It is very similar to the POSIX shm code example. As before, we need the ftruncate to make the shared file non-empty. ### pmap In case you’ve wondered what SysV or POSIX shm segments look like on Illumos, here’s the pmap output for a process that basically runs the first two examples above. 6343: ./a.out 0000000000400000 8K r-x-- /storage/home/jeffpc/src/shm/a.out 0000000000411000 4K rw--- /storage/home/jeffpc/src/shm/a.out 0000000000412000 16K rw--- [ heap ] FFFFFD7FFF160000 4K rwxs- [ dism shmid=0x13 ] FFFFFD7FFF170000 4K rw-s- /tmp/.SHMDblah FFFFFD7FFF180000 24K rwx-- [ anon ] FFFFFD7FFF190000 4K rwx-- [ anon ] FFFFFD7FFF1A0000 1596K r-x-- /lib/amd64/libc.so.1 FFFFFD7FFF33F000 52K rw--- /lib/amd64/libc.so.1 FFFFFD7FFF34C000 8K rw--- /lib/amd64/libc.so.1 FFFFFD7FFF350000 4K rwx-- [ anon ] FFFFFD7FFF360000 4K rwx-- [ anon ] FFFFFD7FFF370000 4K rw--- [ anon ] FFFFFD7FFF380000 4K rw--- [ anon ] FFFFFD7FFF390000 4K rwx-- [ anon ] FFFFFD7FFF393000 348K r-x-- /lib/amd64/ld.so.1 FFFFFD7FFF3FA000 12K rwx-- /lib/amd64/ld.so.1 FFFFFD7FFF3FD000 8K rwx-- /lib/amd64/ld.so.1 FFFFFD7FFFDFD000 12K rw--- [ stack ] total 2120K  You can see that the POSIX shm file got mapped in the standard way (address FFFFFD7FFF170000). The SysV shm segment is special — it is not a plain old memory map (address FFFFFD7FFF160000). That’s it for today. I’m going to talk about segment types in the different post in the near future. ## June 06, 2014 ### Josef "Jeff" Sipek #### Moving and Downtime I’ll be moving my server over the next couple of days. I’m working on an email setup to make sure there’s no interruption there. The website and the blahg will however be down until Wednesday evening. Sorry for any inconvenience this may cause. ## May 24, 2014 ### Nate Berry #### Upgrading the Macbook to Ubuntu 14.04 While I certainly have been enjoying running Arch on usb on the Dell, my main machine is still the old MacBook running Ubuntu. I had some extra time today and thought “hey, I should install steam and grab FTL and Kerbal Space Program” but then promptly decided Id rather do upgrades? Im running Gnome3 not […] ## May 12, 2014 ### Justin Lintz #### Ode to Flickr Over the last 9 years, I’ve had a few things in my life that have benefited me both professionally and personally that have all tied back to Flickr. In 2003, Canon had released the first sub-$1000 digital SLR camera, the Canon EOS 300D aka the EOS Digital Rebel. The 300D paved the way for entry level consumers to get involved with the world of digital SLRs, previously the cheapest option was the Canon 10D, available at $2,000. You could now learn photography with full control over your camera and have instant feedback without burning through rolls of film. I joined Flickr in June of 2005. I was checking the site everyday for months prior, viewing the explore page and wondering how people were taking such incredible photos. How were people getting that “blur” in the background of their photos, and why couldn’t I get that effect with my Nikon Coolpix 2500 camera I had at the time? Researching what made the backgrounds “blurry” (which I later learned is called bokeh, and is related to aperture size), I was introduced into the world of DSLRs. Each day I would check Flickr and I was realizing my current camera was not going to cut it for taking the types of photos I wanted. Not being able to control the ISO, shutter speed or aperture on my camera was holding me back from learning more about photography. I began researching cameras like crazy on dpreview, reading about Nikon vs Canon. Canon and Nikon both had just released followups to their first DSLRs the 350D and the D70s. I decided I really wanted the 350D and began saving all the money I could, combined with some money I got from my 21st birthday, I was able to finally purchase my first “real” camera in June of 2005. I joined the original “Delete Me” group and took part in getting my photos torn apart with no filter on the critiques. I loved all the sub-communities Flickr had created from the groups and would spend hours reading discussions and browsing photos in them. I learned about famous photographers and started to begin to appreciate what made their work special. I fell in love with street photography even though to this day I still can’t get over being comfortable taking a strangers photo. After a year of uploading photos I was fortunate enough that people started reaching out to me for permission to use some of my photos. Small things at first, a Christmas music album cover and marketing materials . My first “big” break came in February 2010, CBS Sunday Morning news contacted me to use a photo I took of Joe Ades, a street salesmen who sold peelers in Union Square. He had passed away and they were doing a feature on him, I granted them the rights to use my photo in the story and it aired on CBS. A year later, the High Line park in NYC would contact me to ask for my photo to be used for their annual fundraising gala. It was used as the background for the official invitation and was blown up to cover the walls behind the bars at the event. They granted me two tickets in exchange for the photo. It was an amazing feeling to see a photo I took on display like that. Fast forward a year later, the High Line contacted me again to use my photo in a book that was being written about the High Line, during this time, they also separately were having a contest for a photo be chosen for sale in their store with proceeds of the sale going towards the maintenance of the park. That same photo eventually won the contest and is still available for sale on the High Line today (although my mom and girlfriend just bought up all the copies at the park store during our Mother’s day visit). After graduating college in 2006 I briefly toyed with the idea of moving out to San Francisco to try and get a job working at Flickr, possibly doing web development. I wasn’t really sure what direction I wanted to take my Computer Science degree. Instead I stayed in NY, and started working as a Systems Administrator. In 2009, I came across a talk from John Allspaw and Paul Hammond, who were then the head of the Operations and Engineering group at Flickr. The talk was at O’Reilly’s Velocity conference and was about how Flickr’s operation’s and engineering teams worked together. I had no idea a community existed of operations folks, nor did I know there were even conferences dedicated to what I did for a living. Watching that talk completely changed my outlook on my job as a system’s administrator. I realized there was so much more I could be doing to make myself better at my job and to help those around me at work. That talk was my first view into the web operations community and from there I started reading about what other people were doing in my field and how they were solving problems. Several years later I got to attend my first Velocity conference out in Santa Clara and it was awesome. So thank you John and Paul for doing your talk. In 2012, Paul did another great talk on “Infrastructure for Startups” , that again, completely resonated with me. A couple years after John and Paul gave their talk at Velocity, John moved out to NY to take a job at Etsy as the VP of Operations. While working at Bitly a couple of years ago, we hired Matthew Rothenberg aka “mroth” who was previously the head of product at Flickr. I think he spent the first couple of weeks at Bitly just answering my fan-boy questions about Flickr. He even got my account hooked up with a beta feature at the time. Mroth introduced me to John who was nice enough to grab lunch with me at Etsy and give me some great career advice along with Mike Rembetsy (One of the people responsible for hiring me at my first job out of college). Flickr gave me a hobby I may not have ever enjoyed as much as I do now and provided a platform that led to others to be able to enjoy my work. The lessons shared by the people working at Flickr made me better at my job and introduced me to a community I didn’t know existed at the time. ## May 01, 2014 ### Josef "Jeff" Sipek #### Task Spooler For a couple of years now, I wished that I could have a mini-batch system on my computers that’d let me submit jobs and they’d execute when the resources became available. This would let me queue up large amount of work and it’d eventually all get processed. I even tried to hack up a dumb little Python script that’d loop over a file executing no more than one per core. Then, yesterday, I stumbled across Task Spooler. It’s exactly what I was looking for! It lets me queue jobs, supports dependencies between jobs, etc. I’m hoping to experiment with it in the next couple of days. I’ll let you know how it turns out. #### Bugs in Time Recently, I blahgd about GCC optimizing code interestingly. There, I mentioned a couple of bugs I’ve stumbled across. I’m going to talk more about them in this post. ### ddi_get_time It all started when I got assigned a bug at work. “The installer hangs while checking available disks.” That’s the extent of the information I was given along with a test system. It didn’t take long to figure that devfsadm -c disk was waiting on a kernel thread that didn’t seem to be making any progress: swtch+0x141 cv_timedwait_hires+0xec cv_reltimedwait+0x51 ibdmibdm_ibnex_port_settle_wait+0x5f ibibnex_bus_config+0x1e8 devi_config_common+0xa5 mt_config_thread+0x58 thread_start+8  The function of interest here is ibdm_ibnex_port_settle, but before I talk about it I need to mention that the ibdm kmod stashes a ddi_get_time timestamp of when the HCA attached. Now, ibdm_ibnex_port_settle calls ibdm_get_waittime to get a delay to feed to cv_reltimedwait. The delay is (more or less) calculated as: ddi_get_time() - hca_attach_time. This works fine as long as ddi_get_time continues incrementing at a constant rate (1 sec/sec). You may already see where this is going. The problem is that ddi_get_time returns a Unix timestamp based on the current time-of-day clock. If the TOD setting changes for whatever reason (daylight saving time adjustments, NTP, etc.), the value returned by ddi_get_time may change non-monotonically. This makes it unsuitable for calculating timeouts and wait times. Converting ibdm_get_waittime to use a monotonic clock source (like gethrtime or ddi_get_lbolt) fixes this bug. (Illumos bug 4777) Things get a bit worse. While figuring out what ddi_get_time does, I noticed that the man page actively encouraged developers to use it for timeouts. (Illumos bug 4776) Of course, once I knew about this potential abuse, I had to check that there weren’t similar issues elsewhere in the kernel… and so I got to file bugs for iprb (4778), vhci (4779), COMSTAR iSCSI target (4780), sd (4781), usba (4782), emlxs (4786), ipf (4787), mac (4788), amr (4789), arcmsr (4790), aac (4791), and heci (4792). I’m fixing all except: amr, arcmsr, aac, and heci. ### NANOSEC While developing the series of fixes mentioned in the previous section, I ran into the fact that NANOSEC was defined as 1000000000. This made it an int — a 32-bit signed integer (on both ILP32 and LP64). If NANOSEC (defined this way) is used to convert seconds to nanoseconds (by multiplying), the naive approach will fail with quantities larger than 2 seconds. For example (hrtime_t is a 64-bit signed int): hrtime_t convert(int secs) { return (secs * NANOSEC); }  Since both secs and NANOSEC are integers, the compiler will compute the product and then sign extend the result to 64-bits. If you look around the Illumos codebase, you’ll see plenty of places that cast or use ULL or LL suffix to make the compiler do the right thing. Why not just change the definition of NANOSEC to include a LL suffix releaving the users of this tedious (and error prone!) duty? Well, now you know what Illumos bug 4809 is about. :) So, I changed the definition and rebuilt everything. Then, using wsdiff (think: recursive diff that understands how to compare ELF files) I found two places where the before and after binaries differed for non-trivial reasons. (I define a trivial reason as “the compiler decided to use registers differently, but the result is the same”.) Each non-trivial difference implies that there was an expression that changed — it used to be busted! The first difference was in ZFS (Illumos bug 4810). There, spa_async_tasks_pending miscalculated a timeout making the condition always true. The second difference was in in.mpathd. 4811). This daemon has a utility function to convert a struct timeval into a hrtime_t. You can read more about it in my previous post. Before the NANOSEC change, I would have needed casts to fix this. With the change in definition, I don’t have to change a thing! And that’s how a one liner closed three bugs at the same time: commit b59e2127f21675e88c58a4dd924bc55eeb83c7a6 Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Date: Mon Apr 28 15:53:04 2014 -0400 4809 NANOSEC should be 'long long' to avoid integer overflow bugs 4810 spa_async_tasks_pending suffers from an integer overflow bug 4811 in.mpathd: tv2ns suffers from an integer overflow bug Reviewed by: Marcel Telka <marcel.telka@nexenta.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com>  ## April 25, 2014 ### Josef "Jeff" Sipek #### GCC Optimizations Recently, I’ve been given a hang bug to work on. This lead me to a another bug related to timing which pushed me to clean up a time related #define which uncovered at least two bugs. Got all that? Good. The rest of this post is going to talk about the changed define, and one of the “at least two bugs”. When I talk about GCC, I’m talking about the Illumos-specific GCC version 4.4.4. (Illumos needs a couple of features that stock GCC doesn’t provide.) The #define change I’m hoping to make is very simple: diff --git a/usr/src/uts/common/sys/time.h b/usr/src/uts/common/sys/time.h --- a/usr/src/uts/common/sys/time.h +++ b/usr/src/uts/common/sys/time.h @@ -234,7 +234,7 @@ struct itimerval32 { #define SEC 1 #define MILLISEC 1000 #define MICROSEC 1000000 -#define NANOSEC 1000000000 +#define NANOSEC 1000000000ll #define MSEC2NSEC(m) ((hrtime_t)(m) * (NANOSEC / MILLISEC)) #define NSEC2MSEC(n) ((n) / (NANOSEC / MILLISEC))  Without it, multiplying by NANOSEC will cause integer overflow issues on IPL32 and LP64 systems (read: basically everywhere). One of the “at least two bugs“ involves a simple (buggy) function aptly named tv2ns as it converts a struct timeval to a 64-bit nanosecond count: static int64_t tv2ns(struct timeval *tvp) { return (tvp->tv_sec * NANOSEC + tvp->tv_usec * 1000); }  At first glance, this function looks correct. The only flaw with it is that first portion of the expression multiplies a time_t (32-bit signed int) with an int (also 32-bit signed) making the result of that subexpression 32-bit signed expression. With NANOSEC changed to a long long, everything works as expected. Now, the fun part… disassembling this function without the fix. You don’t have to be an expert to see that this function is strangely repetitive. I’ve annotated the assembly. tv2ns: movl 0x4(%esp),%eax ; eax = tvp tv2ns+4: movl 0x4(%eax),%edx ; edx = tvp->tv_usec tv2ns+7: leal (%edx,%edx,4),%edx ; edx = edx + 4 * edx tv2ns+0xa: leal (%edx,%edx,4),%edx ; = 5 * edx tv2ns+0xd: leal (%edx,%edx,4),%edx ; ; at this point: edx = 5 * 5 * 5 * tvp->tv_usec, ; which is the same as: 125 * tvp->tv_usec ; tv2ns+0x10: movl (%eax),%eax ; eax = tvp->tv_sec tv2ns+0x12: leal (%eax,%eax,4),%eax ; eax = eax + 4 * eax tv2ns+0x15: leal (%eax,%eax,4),%eax ; = 5 * eax tv2ns+0x18: leal (%eax,%eax,4),%eax tv2ns+0x1b: leal (%eax,%eax,4),%eax tv2ns+0x1e: leal (%eax,%eax,4),%eax tv2ns+0x21: leal (%eax,%eax,4),%eax tv2ns+0x24: leal (%eax,%eax,4),%eax tv2ns+0x27: leal (%eax,%eax,4),%eax tv2ns+0x2a: leal (%eax,%eax,4),%eax ; ; at this point, eax = 5 * 5 * 5 * 5 * 5 * 5 * 5 * 5 * 5 * tvp->tv_sec, ; which is the same as: 1953125 * tvp->tv_sec ; tv2ns+0x2d: shll$0x9,%eax          ; eax <<= 9
;
; eax = (1953125 * tvp->tv_sec) << 9,
; which suprprisingly ends up being the same as: 1000000000 * tvp->tv_sec
;
; so, now we have 'eax' with the tv_sec converted to nanoseconds and 'edx'
; with 125 * tv_usec
;
tv2ns+0x30:     leal   (%eax,%edx,8),%eax ; eax = eax + 8 * edx
;
; 8 * 125 = 1000, which is the factor to convert tv_usec to nanoseconds!
;
tv2ns+0x33:     cltd                      ; sign-extend eax to edx:eax
tv2ns+0x34:     ret


I found it interesting that GCC decided to emit leal instructions to multiply by 5 and then finish it off with a shift and another leal. This is another one of those times when I realize that the compiler is smarter than me. (The sign-extension of course happens too late — all the math needs to happen as 64-bit arithmetic, but that’s not GCC’s fault.)

For the record, with the #define changed, the function looks like the following — sorry, no comments on this one:

tv2ns:          pushl  %edi
tv2ns+1:        pushl  %esi
tv2ns+2:        pushl  %ebx
tv2ns+3:        subl   $0x8,%esp tv2ns+6: movl 0x18(%esp),%ecx tv2ns+0xa: movl 0x4(%ecx),%eax tv2ns+0xd: leal (%eax,%eax,4),%eax tv2ns+0x10: leal (%eax,%eax,4),%eax tv2ns+0x13: leal (%eax,%eax,4),%ebx tv2ns+0x16: shll$0x3,%ebx
tv2ns+0x19:     movl   %ebx,%esi
tv2ns+0x1b:     sarl   $0x1f,%esi tv2ns+0x1e: movl$0x3b9aca00,%edi
tv2ns+0x23:     movl   (%ecx),%eax
tv2ns+0x25:     imull  %edi
tv2ns+0x27:     movl   %eax,(%esp)
tv2ns+0x2a:     movl   %edx,0x4(%esp)
tv2ns+0x35:     popl   %ebx
tv2ns+0x36:     popl   %esi
tv2ns+0x37:     popl   %edi
tv2ns+0x38:     ret


Maybe one day I’ll rummage through my brain and dig up other times that GCC is outsmarted me and blahg about them. :)

## April 10, 2014

### Justin Dearing

#### The case for open sourcing the SQL Saturday Website

My name is Justin Dearing. I write software for a living. I also write software for free as hobby and for personal development. When I’m not writing code, I speak at user groups, events and conferences about code and code related topics. Once such event is SQL Saturday. I haven’t spoken in a while because I became a dad in June. However, my daughter is 9 months old now and the weather is warm. I feel comfortable attending a regional SQL Saturday or two.

So last night I submitted to SQL Saturday Philadelphia. The submission process (I mean the mechanical process of using the website to submit my abstract) was annoying, as usual. What really got me going though was when I realized two things:

• My newlines were not being preserved so that my asterisks that were supposed to punctuate bullet points were not at the beginnings of lines.
• I could not edit my submission once submitted.

I like bullet points, a lot. However, I digress. In response to my anger, I complained on twitter that the site should be open sourced, so I the end user could create a better experience for myself and my fellow SQL Saturday Speakers.

I got three retweets. At least I wasn’t completely alone in my sentiment. I complained again in the morning, started a conversation and eventually Tim sent this out this:

So the site was being rewritten, but it would not be open sourced.

Should I have been happy at that point, or at least patiently await the changes? One could presume that session editing and submission would be improved. At the very least, things would get progressively better as there were revisions to the code. If the federal government could pull off the ObamaCare site, with some hiccups, why can’t a group of DBAs launch a much smaller website, with much simpler requirements and lower load?

I’d be willing to bet they will. I’d be willing to bet that this site will suck a lot less than the old site, and that it will continue to progress. I’m sure smart people are working on it, and a passionate BoD are guiding the process. At the very least I’ll withhold judgement until the new site is live.

Despite my confidence in the skills of the unknown (to me) parties working on the site, there are so many hours in the day and only so many things a team of finite size can do. However, a sizable minority of PASS’s membership are .NET developers. Many of them speak at SQL Saturdays. They have to submit to the site. Some of them will no doubt be annoyed at some aspect of the site. Some of them might fix that annoyance, or scratch their itch in OSS parlance, if the site was open source and there was a process to accept pull requests.

I’m not describing a hypothetical nirvana. I’ve seen the process I describe work because I’m submitted a lot of patches to a lot of OSS projects. I’ve submitted a patch to the (not actually open source, as Brent will be the first to state) sp_blitz and Brent accepted it. I’ve contributed to NancyFX. I once contributed a small patch to PHP to make it consume WCF services better. I’ve contributed to several other OSS projects as well.

Perhaps your saying SQL Server is a Microsoft product, not some hippie Linux thing. Perhaps you share the same sentiment as Noel McKinney:

However, as I pointed out to Noel, the mothership’s (i.e, Microsoft’s Editors Note: Noel has stated to me he meant Microsoft) beliefs are not anti OSS. Microsoft has fully embraced Open Source. You can become an MVP purely for OSS without any speaking or forum contributions. One of the authors of NancyFX is an example of such a recipient. F#, ASP.NET and Entity Framework are all open source. Just this week Microsoft Open Sourced Roslyn. As a matter of fact I’ve even submitted a patch to the nuget gallery website, which is operated by Microsoft and owned by the OuterCurve foundation. The patch was accepted and my code, along with the code of others was pushed to nuget.org. So I’ve already submitted source code for a website owned and operated by an independent organization  setup by Microsoft, they’ve already accepted it, and the world seems a slightly better place as a result.

So I ask the PASS BoD to consider releasing the SQL Saturday Website source code on github, and I ask the members of PASS to ask their BoD to release the source code as well.

## April 07, 2014

### Josef "Jeff" Sipek

#### Happy 50th, System/360

It’s been a while since I blahged about mainframes. Rest assured, I’m still a huge fan, I’m just preoccupied with other things to continuously extoll their virtues.

The reason I’m writing today is because it is the 50th anniversary of the System/360 announcement. Aside from the “50 years already?” sentiment, I have a couple of images to share. (I found these several years ago on someone’s GeoCities site. It’s a good thing I made a mirror :) )

I also came across this video from 1964:

## March 24, 2014

### Josef "Jeff" Sipek

#### Netflix Chaos Monkey

Somehow, I managed to miss that about two years ago Netflix open sourced their chaos monkey.

Based on my quick look over the code, it appears to be written in Java. Meh. Regardless of the language, it’s great to see large companies open source their code.

## March 02, 2014

### Josef "Jeff" Sipek

#### Comment Spam Filtering Experiments

Just a heads up, I’m getting fed up with all the comment spam that ends up on the moderation queue. So, I’m working on some code to reject comment spam before it hits it. As the title for this post implies, these are experiments; I’ll try my best not to reject any valid comments. I appologize if a valid comment does get rejected.

If you end up being a victim of my overzealous filters, please email me: jeffpc@josefsipek.net.

## February 22, 2014

### Josef "Jeff" Sipek

#### Greetings from Nexenta

In case you missed it, back in mid-2011 I discovered Illumos and OpenIndiana. At that point, I already missed hacking on the (Linux) kernel. Based on my blahg posts [1,2], it shouldn’t surprise you that it didn’t take long before I wanted to hack on the Illumos kernel…and so I did.

If you ever contributed to an open source project in your free time while employed full-time, you understand that there’s only so much time you can devote to the open source project and therefore there is only so much you can do.

A couple of months ago, I decided to explore the possibility of working full-time on Illumos. There are only a handful of companies that visibly participate in the Illumos ecosystem, but their use of Illumos is pretty varied (from public clouds to virtualized databases to SAN/NAS appliances). As of this past Tuesday (Monday was a holiday), I’m at Nexenta. At least for now, I’m working remotely (from Ann Arbor) with the fine folks in the  Lowell office. It feels great to work on open source again.

## February 16, 2014

### Nate Berry

#### Arch Linux on bootable, persistent USB drive

I recently got a new laptop from work. Its a refurbished Dell Latitude E6330 with an Intel Core i5 processor, a 13″ screen and a 120GB SSD drive that came with Windows 7 Pro. I haven’t used Windows regularly in quite some time (I’ve been using a WinXP VM on the rare occassion I need […]

### Justin Dearing

#### Creating a minimally viable CentOS OpenLogic rapache instance

Recently I’ve been dealing with R and rapache at work. R is a language for statisticians. rapache is an apache module for executing R scripts in apache. Its like mod_perl or mod_php for R. I’ve been writing simple RESTful scripts that return graphics and JSON, and calling them from static html pages. I’ve been also using my MSDN Azure subscription to engage in R self study at home. In the spirit of my last post, I’ve posted the setup notes here to get you stated with a new Azure VM for running an rapache instance. Azure used a special cloud enabled version fo CentoS 6.3 called OpenLogic. However, it seems to work similarly to the vanilla CentoOS 6.4 instances I’ve used at work. So everything should apply there. If something doesn’t work leave a comment.

• First, CentOS is very conservative, but Fedora makes EPEL to give you a more modern set of RPMs
• rpm -Uvh http://epel.mirror.freedomvoice.com/6/i386/epel-release-6-8.noarch.rpm
• Now lets install the packages we need. The kernel will be updated, so we will need to reboot.
• yum update -y
• yum install -y vim-x11 vim-enhanced xauth R terminator xterm rxvt R httpd git httpd-devel gcc cairo cairo-devel libXt-devel
• yum groupinstall -y fonts
• ldconfig
• shutdown -r now
• Now as a regular user lets compile rapache.
• mkdir ~/src
• cd ~/src
• git checkout https://github.com/jeffreyhorner/rapache.git
• cd rapache
• ./configure && make && sudo make install
• Now lets configure rapache. Create a file called /etc/httpd/conf.d/rapache.conf with the following:
# rapache configuration by Justin Dearing <zippy1981@gmail.com>
<Location /RApacheInfo>
SetHandler r-info
</Location>
RHandler sys.source
• Now restart apache.  Make sure it’t working by running
elinks http://localhost/RApacheInfo.`

Azure doesn’t configure swap space by default. You’re going to absolutely need some swap space if you’re using an extra small instance. A good howto for that is here.

## February 15, 2014

### Justin Lintz

#### Pager Huety

For a hack week project at Chartbeat, I hooked my Philip’s Hue light bulbs into PagerDuty so whenever I get paged my lights will start flashing. Read about the hack over on the PagerDuty blog

#### Lessons learned tuning TCP and Nginx in EC2 at Chartbeat

I wrote a couple blog posts for work diving into what we learned optimizing our performance in EC2.

You can read both parts over at

http://engineering.chartbeat.com/2014/02/12/part-2-lessons-learned-tuning-tcp-and-nginx-in-ec2/