Troubleshooting Ruby hangs on windows with windbg by Matt Wrock

Ahhh windbg...those who are familiar with it know that you can't live with it and you can't live without it. Unless you are a hardened windows c++ dev (I am not), if you have used windbg, it was in a moment of desperation when all of the familiar tools just wouldn't cut it. So now whenever you hear the word "windbg", it conjours up the archetype of hard to use, arcane, engineering tooling. Step aside git and vi, windbg kicks your ass when it comes to commands that were intended to be forgotten. Yet here's the thing, windbg has saved your but. It allowed you to see things nothing else could show you. Did git or vi provide this value?...ok...maybe they did...but still!!

Many of us have personal windbg stories. These usually involve some horrendous production outage and are mixed with the drama of figuring out a tool that was not meant to just be picked up and mastered. I have several of these stories in my high volume web development days. Here are a couple:

  1. A frequently called function uses a new regex causing CPU to spike accross all web nodes.
  2. Web site becomes unresponsive because the asp.net thread pool is saturated with calls to a third party web service that is experiencing latency issues.

I survived these events and others thanks to windbg. Thanks windbg! But I still hate you and I think I always will.

windbg and ruby?

So why am I talking about windbg and ruby in the same post? Are they not from two different worlds that should remain apart? Well I have now had two separate incidents in the past few months where I have needed to use windbg with a problem in ruby code. Both of these incidents are similar in that they both differ from my previous run ins with windbg where I was debugging managed code and they both were related to a similar Chef related problem.

When it comes to ruby, its all unmanaged code and you may not know the difference between managed and unmanaged code in the windows world and that is totally ok, but if you do, I can tell you there are extensions you can use with windbg (like sos.dll) to make your life easier. Sorry, those do not apply here. Its all native from here on out!

I've blogged long ago about debugging native .net code with windbg, bit one, thats a super old post and two, I had to discover new "go to" commands for dealing with unmanaged memory dumps.

Lets talk like we are 2

As in two years old. First, two year olds are adorable. But the real reason is that's pretty much the level where I operate and whatever windbg maturity I gained over the past few days will be lost the next time I need it. I'm writing this post to my future self who has fewer fast twitching brain cells that I have now and is in a bind and needs a clear explanation of how to navigate in a sea of memory addresses.

So to those who are more seasoned, this may not be for you and you may know a bunch more tricks to make this all simpler and for you god created commenting engines.

The problem

So here is the scenario. You notice that your windows nodes under Chef management are not converging. Maybe, they are supposed to converge every few minutes but they have not converged for hours.

So you figure something has happened to cause my nodes to fail and expect to find chef client logs showing chef run after run with the same failures. Nothing new, you will debug the error and fix it. So you open the log and see the latest chef run has just started a new run. Then you look at the time stamp and notice it started hours ago but never moved past finding its runlist. Its just sitting there hung.

No error message to troubleshoot. There is something wrong but the data stops there. What do you do?

Two words...windbg my friend...windbg. Ok its not two words but kind of has a two word ring to it.

What is this windbg?

Windbg is many things really. At its core, its simply a debugger and many use it to attach to live processes and step through execution. That's not usually a good idea when troubleshooting a multithreaded application like a web site but may not be bad for a chef run. However, I have never used it in this manner.

Another very popular usage is to take a snapshot of a process, also called a memory dump, and use it to deeply examine exactly what was going on in the system at that point in time.

The great thing is that this snapshot is very complete. It has access to all memory in the process, all threads and all stack traces. However the rub is that it is very raw data. Its just raw memory, a bunch of addresses pointers and hex values that may cause more confusion than help.

There are several commands to help sort out the sea of data but its likely far less familiar, intuitive or efficient than your day to day dev tools.

This is one reason why I write this post and why I wrote my last windbg post, I can never remember this stuff and the act of committing my immediate memory to writing and having a permanent record of this learning will help me when I inevitably have another similar problem in the future.

Taking a dump

What?!

Oh...god...no.

Thats really what we call this. Seriously. I sit with straight faced, well paid, professional adults and ask them to take dumps and give them to me.

Oh stop it.

Seriously though, this is the first stumbling point in the debugging endeavor. There are several kinds of memory dumps (crash dumps, hang dumps, minidumps, user mode dumps, kernel dumps, etc), each have their merits and there are different ways to obtain them and some means are more obscure than others.

For debugging ruby hangs, we generally just need a user mode dump of the ruby.exe process. This is not going to be a thorough discussion on all the different types of dumps and the means to produce them but I will cover a couple options.

Task Scheduler

In recent versions of windows, they come equipped with a crash dump generation tool that anyone can easily access. Simply right click on the process you want to examine and then select "create dump file", this generates a user mode dump file of the selected process. There are a couple downsides to collecting dumps in this manner:

1. These dumps do not include the handle table and therefore any use of the !handle command in windbg will fail.

2. On a 64 bit machine, unless you explicitly invoke the 32 bit task scheduler, you will get a 64 bit dump even of 32 bit processes. This is not a big deal really and we'll talk about how to switch to 32 bit mode in a 64 bit dump.

ProcDump

There is a sysinternals tool, ProcDump, that can be used to generate a dump. This tool allows you to set all kinds of thresholds in order to capture a dump file at just the right time based on CPU or memory pressure as well as other factors. You can also simply capture a dump immediately. I typically run:

procdump -ma <PID> <path of dump file>

So in the event of analyzing a ruby process dump, I simply give it the process id of ruby.exe. This will include the handle table and is smart enough to produce a 32 bit dump for 32 bit processes.

There are plenty of other ways to get dumps. I used to use AdPlus.vbs years ago and I *think* that will still work and ships with the debugging tools for windows that also includes windbg. However you get your dump file, it will end in a .dmp extension. If supplying a output path to procdump or other generator, make sure you use that extension when naming the file.

When to take the dump

In this post we are discussing hang scenarios. So when analyzing hangs, take the dump when your process is hanging. If we were analyaing a CPU pinning scenario, we would take the dump when the process was pinning the cpu. This can be easier said than done especially in production environments where hangs and pins may not surface on demand and may take hours or days to occur after a process starts. If you find yourself in such a situation where its tricky to generate the dump file at just the right time, have a look at the various procdump switches that can wait for just the right moment based on several different triggers.

Simulating high memory pressure

In the scenario described here, one of the hang events was preceded by high memory pressure on the chef node as seen from various flavors of OutOfMemoryExceptions on preceding chef runs. To reproduce the scenario, I needed to create an environment where there was little memory available on the box. For the benefit of my future self and others, here is how I did that:

  1. Created a VM with relatively low memory. 512MB was adequate in my case.
  2. Shrunk the paging file down to about 128MB
  3. Used the sysinternals tool, TestLimit to artificially leak memory on the box running: TestLimit64.exe -d 1 -c

At this point based on analyzing previous dumps, I knew exactly where in the chef run the problem was occurring so I just created a little ruby program to run the problem call in a loop until the ruby process hanged.

Getting windbg

Windbg is included in Microsoft's debugging toolset which is included in the Windows SDK. From the microsoft download site, you can download the entire SDK or choose only the debugging tools.

Another option is chocolatey. I maintain the windbg chocolatey package and this package will download and install the windows debugging tools and does a few extra helpful things:

  1. Adds a microsoft symbol server path environment variable that windbg will pick up. Assigning the path symsrv*symsrv.dll*f:\localsymbols*http://msdl.microsoft.com/download/symbols to the _NT_SYMBOL_PATH variable. This is important if you want to be able to make much sense of the dump files. Without symbols, you are left to raw memory addresses with no specific information on the structure and calling patterns of key classes and structures in the call stack.
  2. Copies the sos.dll from your local .net installation to the same folder where windbg is installed. This comes in handy for managed code debugging not covered here.
  3. Adds separate shortcut icons for the 64 bit and 32 bit versions of windbg. It can be important depending on the type of process dumped, to use the corresponding version of windbg. It really does not mater so much for this post.

Simply run:

choco install windbg -y

to install the chocolatey windbg package. I do this on all of my dev machines.

Analyzing dump files

So now you have the tools, windbg, you need and a dump file of a problem process to analyze. The remainder of this post will cover how to make sense of the dump with windbg. I will by no means be covering all the commands possible. In fact I'm just gonna cover a handful which should be all you need but I'd encourage anyone to at least browse over this helpful command summary since there very well may be a command that is better suited for your scenario. I'd also point out that I am NOT a windbg guru and if you see a command or strategy I am missing here that you think would be helpful, please provide a comment.

Loading the dump file

Once windbg is open, navigate to File -> Open Crash Dump and browse to the .DMP file you captured. I know...your process may not have actually "crashed", but this is the correct menu option.

So now your dump is loaded into windbg and the entire memory heap of the process is at your fingertips. Here is the typical path I walk at this point under unmanaged scenarios like a ruby hang.

Which thread do I care about?

Chances are nearly certain that there is more than a single thread captured in the dump and in fact depending on the nature of the app, there can be quite a few. However it is likely you only care about one. At least that was the case in my scenario. I want to examine the thread that is hung.

I'll start by running !runaway. Not because thats the thing I want to do most at this time. Rather this will list all threads and how much user time they have consumed:

0:000> !runaway
 User Mode Time
  Thread       Time
   0:734       0 days 0:20:11.843
   1:94        0 days 0:00:00.078

Here there are only two threads. I am more interested in the one that was running for 20 minutes: thread 0. Now the times displayed here may be lower, perhaps dramatically lower than you expect. My process had been running for a couple hours, what's up with 20 minutes!? Its even possible that your process has been hung for days, and the time displayed here will be a mere milliseconds.

Its important to note that this is "User Time" and is the amount of time that the thread was actually working. To get actual "elapsed time" you can add a parameter to !runaway. I'll run !runaway 7 to get user, kernel and elapsed time:

0:000> !runaway 7
 User Mode Time
  Thread       Time
   0:734       0 days 0:20:11.843
   1:94        0 days 0:00:00.078
 Kernel Mode Time
  Thread       Time
   0:734       0 days 0:13:19.437
   1:94        0 days 0:00:00.031
 Elapsed Time
  Thread       Time
   0:734       0 days 2:58:53.443
   1:94        0 days 2:58:53.412

This is from the dump where I artificially created the high memory pressure environment and the same ruby thread was looping for hours. Of the 3 hours it was looping, the thread itself was only doing work for 20 minutes. Some of the production dumps I looked at had a much higher elapsed time (days) and lower user time (just milliseconds) because Chef had started a ruby process that immediately invoked a blocking call to an external process and then waited for days, during which time it did no work. Lazy bum.

Call stacks

Now lets look at some call stacks. Even though I'm pretty sure at this point that thread 0 is the one I'm interested in, I'll fist browse the stack of all threads using: ~*k

0:000> ~*k

.  0  Id: 90c.734 Suspend: 0 Teb: 7ffdd000 Unfrozen
ChildEBP RetAddr  
0028ea2c 7746112f ntdll!NtWaitForMultipleObjects+0xc
0028ebb8 752dd433 KERNELBASE!WaitForMultipleObjectsEx+0xcc
0028ec1c 76f2fab4 user32!MsgWaitForMultipleObjectsEx+0x163
0028ec54 7702b50c combase!CCliModalLoop::BlockFn+0x111 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1571]
(Inline) -------- combase!ModalLoop+0x52 [d:\9147\com\combase\dcomrem\chancont.cxx @ 129]
(Inline) -------- combase!ClassicSTAThreadWaitForCall+0x52 [d:\9147\com\combase\dcomrem\threadtypespecific.cpp @ 172]
0028ecc0 7702a259 combase!ThreadSendReceive+0x1d3 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5776]
(Inline) -------- combase!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xd7 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5090]
0028ee0c 76f2fe0d combase!CRpcChannelBuffer::SendReceive2+0x1e9 [d:\9147\com\combase\dcomrem\channelb.cxx @ 4796]
(Inline) -------- combase!ClientCallRetryContext::SendReceiveWithRetry+0x31 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1090]
(Inline) -------- combase!CAptRpcChnl::SendReceiveInRetryContext+0x3b [d:\9147\com\combase\dcomrem\callctrl.cxx @ 715]
0028eecc 76f0c65d combase!ClassicSTAThreadSendReceive+0x21d [d:\9147\com\combase\dcomrem\callctrl.cxx @ 696]
(Inline) -------- combase!CAptRpcChnl::SendReceive+0x89 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 752]
0028ef30 7702a010 combase!CCtxComChnl::SendReceive+0x105 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 790]
0028ef54 76dc5769 combase!NdrExtpProxySendReceive+0x5c [d:\9147\com\combase\ndr\ndrole\proxy.cxx @ 2017]
0028ef6c 76e46c1b rpcrt4!NdrpProxySendReceive+0x29
0028f398 77029e1e rpcrt4!NdrClientCall2+0x22b
0028f3b8 76f0c46f combase!ObjectStublessClient+0x6c [d:\9147\com\combase\ndr\ndrole\i386\stblsclt.cxx @ 215]
0028f3c8 7494a5c6 combase!ObjectStubless+0xf [d:\9147\com\combase\ndr\ndrole\i386\stubless.asm @ 171]
0028f440 74d0627f fastprox!CEnumProxyBuffer::XEnumFacelet::Next+0xd6
0028f484 76cca040 wbemdisp!CSWbemObjectSet::get_Count+0xdf
0028f4a0 76cca357 oleaut32!DispCallFunc+0x16f
0028f748 74cfe81f oleaut32!CTypeInfo2::Invoke+0x2d7
0028f784 74d0876e wbemdisp!CDispatchHelp::Invoke+0xaf
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for win32ole.so - 
0028f7b8 6ab901e2 wbemdisp!CSWbemDateTime::Invoke+0x3e
WARNING: Stack unwind information not available. Following frames may be wrong.
0028f8b8 6ab90bde win32ole+0x101e2
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for msvcrt-ruby200.dll - 
0028f988 668e3539 win32ole+0x10bde
0028fa18 668ef3f8 msvcrt_ruby200!rb_error_arity+0x129
0028fab8 668ef712 msvcrt_ruby200!rb_f_send+0x518
0028fb38 668e7fdd msvcrt_ruby200!rb_f_send+0x832
0028fc68 668ec1a4 msvcrt_ruby200!rb_vm_localjump_error+0x260d
0028fd78 668f4f71 msvcrt_ruby200!rb_vm_localjump_error+0x67d4
0028fdd8 667c4651 msvcrt_ruby200!rb_iseq_eval_main+0x131
0028fe58 667c6b3d msvcrt_ruby200!rb_check_copyable+0x37d1
*** ERROR: Module load completed but symbols could not be loaded for ruby.exe
0028fe88 0040287f msvcrt_ruby200!ruby_run_node+0x2d
0028feb8 004013fa ruby+0x287f
0028ff80 77087c04 ruby+0x13fa
0028ff94 7765ad1f kernel32!BaseThreadInitThunk+0x24
0028ffdc 7765acea ntdll!__RtlUserThreadStart+0x2f
0028ffec 00000000 ntdll!_RtlUserThreadStart+0x1b

   1  Id: 90c.94 Suspend: 0 Teb: 7ffda000 Unfrozen
ChildEBP RetAddr  
022cfec8 77452cc7 ntdll!NtWaitForSingleObject+0xc
022cff3c 77452c02 KERNELBASE!WaitForSingleObjectEx+0x99
022cff50 668fe676 KERNELBASE!WaitForSingleObject+0x12
WARNING: Stack unwind information not available. Following frames may be wrong.
022cff80 77087c04 msvcrt_ruby200!rb_thread_list+0x10d6
022cff94 7765ad1f kernel32!BaseThreadInitThunk+0x24
022cffdc 7765acea ntdll!__RtlUserThreadStart+0x2f
022cffec 00000000 ntdll!_RtlUserThreadStart+0x1b

OK, so the ~*k semantics were probably obvious but remember we are 2. Lets walk through each character:

~ - Typing this alone would have simply listed 2 lines: each thread and their status.

* - This returned all threads. We could have supplied a number instead to get a single thread like ~0k to get just thread 0.

k - this produced the call stack.

There are more single character commands we can use after the 'k' each equally intuitive and discoverable. We will look at a couple in just a bit.

So here we have call stacks for each thread with a line for each frame in the stack. One can kind of groc the gist of the execution here: a bunch of ruby C calls and then an RPC call is sent and it waits for that call to return. The two addresses at the front of each frame are likely not much help for us. They refer to the base pointer of the frame and the return address where execution is to resume.

So now you are dissapointed. You were probably thinking you'd be seeing an actual ruby stack trace. Yeah. That would be cool. So one challenge we face is finding what actual ruby code, perhaps that we wrote, is at play here. However, if you find yourself needing to debug at this level, there are things happening outside of ruby that are of real interest.

Another thing to note is that the ruby frames seem brief compared to the COM frames above. Later we'll see how examining the COM is much easier than ruby. This is because we lack symbols for the compiled ruby runtime. I have not tried it myself, but we could compile ruby ourselves and prooduce symbols then capture a dump running on that ruby. This would likely be helpful but not practical in a production environment.

Switching to the correct bitness

Maybe you are seeing nothing like this at all. If you are seeing something like this:

  0  Id: 90c.734 Suspend: 0 Teb: 00000000`7ffdb000 Unfrozen
Child-SP          RetAddr           Call Site
00000000`0008ee08 00000000`77581e66 wow64cpu!CpupSyscallStub+0x2
00000000`0008ee10 00000000`7754219a wow64cpu!WaitForMultipleObjects32+0x1d
00000000`0008eec0 00000000`775420d2 wow64!RunCpuSimulation+0xa
00000000`0008ef10 00007ffc`7f383a15 wow64!Wow64LdrpInitialize+0x172
00000000`0008f450 00007ffc`7f362f1e ntdll!LdrpInitializeProcess+0x1591
00000000`0008f770 00007ffc`7f2d8ebe ntdll!_LdrpInitialize+0x8a00e
00000000`0008f7e0 00000000`00000000 ntdll!LdrInitializeThunk+0xe

This is what a 32 bit thread looks like in a 64 bit dump. Thats ok. You can switch inside of this same dump to 32 bit mode by running:

!wow64exts.sw

Now run ~0k and you will see a stack like the larger one before.

Going deeper into the stack

So I'll be honest, the ruby frames here make no sense to me. This may be due to the lack of symbols but the last ruby frame would indicate that an argument error is being thrown. However it is called by rb_f_send which does not call rb_error_arity according to the ruby source and rb_error_arity does not call into win32ole. So I don't know whats going on here. Now I do know that the ruby win32ole class manages WMI queries and I also know that wbem is the prefix of the wmi scripting library in windows. Further, WMI queries are handled out of process which would explain the RPC call. In this case, the hang is occuring in a Chef client run but before the cookbook recipes actually run. This is when chef "ohai" data is collected which is the data that describes the current state of a machine and windows nodes leverage WMI for much of that data.

So is there any more info that might lend us some clues here? Well one step we can take is to query the thread 0 stack again and this time try to get the actual parameters being passed into the function calls. Thats done by adding...wait...I wont tell...you guess...

...thats right! Of course...'b'. We add b to the ~0k command so we run ~0kb:

0:000> ~0kb
ChildEBP RetAddr  Args to Child              
0028ea2c 7746112f 00000002 0028ebe0 00000001 ntdll!NtWaitForMultipleObjects+0xc
0028ebb8 752dd433 00000002 0028ebe0 00000000 KERNELBASE!WaitForMultipleObjectsEx+0xcc
0028ec1c 76f2fab4 00000001 0028ec80 ffffffff user32!MsgWaitForMultipleObjectsEx+0x163
0028ec54 7702b50c 0028ec80 00000001 0028ec7c combase!CCliModalLoop::BlockFn+0x111 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1571]
(Inline) -------- -------- -------- -------- combase!ModalLoop+0x52 [d:\9147\com\combase\dcomrem\chancont.cxx @ 129]
(Inline) -------- -------- -------- -------- combase!ClassicSTAThreadWaitForCall+0x52 [d:\9147\com\combase\dcomrem\threadtypespecific.cpp @ 172]
0028ecc0 7702a259 0028ef84 005c9c54 00505170 combase!ThreadSendReceive+0x1d3 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5776]
(Inline) -------- -------- -------- -------- combase!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xd7 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5090]
0028ee0c 76f2fe0d 005c9c54 0028ef84 0028ef4c combase!CRpcChannelBuffer::SendReceive2+0x1e9 [d:\9147\com\combase\dcomrem\channelb.cxx @ 4796]
(Inline) -------- -------- -------- -------- combase!ClientCallRetryContext::SendReceiveWithRetry+0x31 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1090]
(Inline) -------- -------- -------- -------- combase!CAptRpcChnl::SendReceiveInRetryContext+0x3b [d:\9147\com\combase\dcomrem\callctrl.cxx @ 715]
0028eecc 76f0c65d 005c9c54 0028ef84 0028ef4c combase!ClassicSTAThreadSendReceive+0x21d [d:\9147\com\combase\dcomrem\callctrl.cxx @ 696]
(Inline) -------- -------- -------- -------- combase!CAptRpcChnl::SendReceive+0x89 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 752]
0028ef30 7702a010 005c9c54 0028ef84 0028ef4c combase!CCtxComChnl::SendReceive+0x105 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 790]
0028ef54 76dc5769 02b0f2c4 0028efb0 76dc5740 combase!NdrExtpProxySendReceive+0x5c [d:\9147\com\combase\ndr\ndrole\proxy.cxx @ 2017]
0028ef6c 76e46c1b 29b71661 76f113e0 0028f3d0 rpcrt4!NdrpProxySendReceive+0x29
0028f398 77029e1e 74ba39b0 74ba4ce0 0028f3d0 rpcrt4!NdrClientCall2+0x22b
0028f3b8 76f0c46f 0028f3d0 00000003 0028f440 combase!ObjectStublessClient+0x6c [d:\9147\com\combase\ndr\ndrole\i386\stblsclt.cxx @ 215]
0028f3c8 7494a5c6 02b0f2c4 005bc634 ffffffff combase!ObjectStubless+0xf [d:\9147\com\combase\ndr\ndrole\i386\stubless.asm @ 171]
0028f440 74d0627f 005bc690 ffffffff 00000001 fastprox!CEnumProxyBuffer::XEnumFacelet::Next+0xd6
0028f484 76cca040 02b7b668 0028f4f8 0051197c wbemdisp!CSWbemObjectSet::get_Count+0xdf
0028f4a0 76cca357 02b7b668 00000024 00000004 oleaut32!DispCallFunc+0x16f
0028f748 74cfe81f 004f54ec 02b7b668 00000001 oleaut32!CTypeInfo2::Invoke+0x2d7
0028f784 74d0876e 02b7b678 00000001 6ab991f4 wbemdisp!CDispatchHelp::Invoke+0xaf
0028f7b8 6ab901e2 02b7b668 00000001 6ab991f4 wbemdisp!CSWbemDateTime::Invoke+0x3e
WARNING: Stack unwind information not available. Following frames may be wrong.
0028f8b8 6ab90bde 00000003 00000000 01dc6590 win32ole+0x101e2
0028f988 668e3539 00000001 00433098 02386500 win32ole+0x10bde
0028fa18 668ef3f8 0262556e 02487ce8 00000015 msvcrt_ruby200!rb_error_arity+0x129
0028fab8 668ef712 0043308c 00433088 00000004 msvcrt_ruby200!rb_f_send+0x518
0028fb38 668e7fdd 00375750 004b2f30 02419480 msvcrt_ruby200!rb_f_send+0x832
0028fc68 668ec1a4 01d5af70 00000004 02487ed5 msvcrt_ruby200!rb_vm_localjump_error+0x260d
0028fd78 668f4f71 00000001 00000000 01dda1b8 msvcrt_ruby200!rb_vm_localjump_error+0x67d4
0028fdd8 667c4651 02522f58 00000000 ffffffff msvcrt_ruby200!rb_iseq_eval_main+0x131
0028fe58 667c6b3d 0028fe7c 00000000 0028fe88 msvcrt_ruby200!rb_check_copyable+0x37d1
0028fe88 0040287f 02522f58 00372cb8 0028ff80 msvcrt_ruby200!ruby_run_node+0x2d
0028feb8 004013fa 00000002 00372cb8 00371550 ruby+0x287f
0028ff80 77087c04 7ffde000 77087be0 28433cec ruby+0x13fa
0028ff94 7765ad1f 7ffde000 28241b9c 00000000 kernel32!BaseThreadInitThunk+0x24
0028ffdc 7765acea ffffffff 7764024a 00000000 ntdll!__RtlUserThreadStart+0x2f
0028ffec 00000000 004014e0 7ffde000 00000000 ntdll!_RtlUserThreadStart+0x1b

This adds three more memory addresses to each frame and they point to the first three arguments passed to the function on the stack.

So these may be hex representations of actual values of pointers to other data structures. If we have symbols for the code being called, things get alot easier but lets look at a ruby call first. I'd like to check out the args passed to the last ruby call in the stack.

Examining raw address values

So the command to display memory in windbg is 'd'. I know, thats really kind of a given but remember we are 2. Also, its not just d. d is always accompanied by another character to let windbg know how to display that value. Here are some I find myself using alot:

  • dd - DWORD (4bytes) by default it spits out 32 of them but you can limit the number with the L parameter (ex. dd 7746112f L2)
  • dc - Same output as dd but adds ascii representation to the right
  • da - displays the address memory as a ascii string
  • du - displays the memory as a unicode string

There is another one, dt, that is super helpful and we will look at that in just a bit.

So one thing I am curious about here are the arguments passed to the last ruby frame:

0028fa18 668ef3f8 0262556e 02487ce8 00000015 msvcrt_ruby200!rb_error_arity+0x129

Remember, its the 2nd - 4th addresses that contain the first three arguments. You always see 3 addresses regardless of the number of arguments the method takes so these may point to nowhere useful. Looking at the first argument dc 0262556e produces:

0:000> dc 0262556e 
0262556e  336e6977 6e705f32 67697370 6464656e  win32_pnpsignedd
0262557e  65766972 26340072 5f564552 335c3130  river.4&REV_01\3
0262558e  37363226 36313641 26332641 00003833  &267A616A&3&38..
0262559e  3d740241 1b8612c8 43508000 45565c49  A.t=......PCI\VE
026255ae  30385f4e 44263638 375f5645 26323931  N_8086&DEV_7192&
026255be  53425553 305f5359 30303030 26303030  SUBSYS_00000000&
026255ce  5f564552 335c3330 37363226 36313641  REV_03\3&267A616
026255de  26332641 00003030 3d7d0000 1c011331  A&3&00....}=1...

OK, this is interesting and curious. win32_pnpsigneddriver is a valid WMI class. However, this would not be a valid argument for rb_error_arity which takes 3 ints. So now I'm just dismissing the entire ruby stack to lack of debug symbols, but the I feel confident a WMI call is a factor. Here is another clue: the last argument on the second win32ole frame:

0:000> dc 02386500 
02386500  0000000c 02361db0 00000000 6ab834f0  ......6......4.j
02386510  0241c9d8 00000000 09402005 01d5a820  ..A...... @. ...
02386520  00000023 02571888 00000023 00000000  #.....W.#.......
02386530  0000a007 01d528a0 02386560 00000000  .....(..`e8.....
02386540  00000000 00000000 09424005 01d5a820  .........@B. ...
02386550  63657845 72657551 00000079 00000000  ExecQuery.......
02386560  00502005 01d5a820 00000023 02625560  . P. ...#...`Ub.
02386570  0000003e 00000000 08202005 01d5a820  >........  . ...

ExecQuery is the WMI method name that performs queries on WMI classes.

So searching the chef ohai gem I come accross:

drivers = wmi.instances_of('Win32_PnPSignedDriver')

This instances_of method uses the wmi-lite gem calling ExecQuery on the locator returned from:

locator = WIN32OLE.new("WbemScripting.SWbemLocator")

So now I have a pretty good idea what application code is triggering the hang. As I stated earlier in this post, WMI queries create an RPC call to a process named WmiPrvSE.exe. Its not uncommon to see more than one of these processes on a machine. So now I'm wondering if it might be possible to track which of these processes this thread is waiting on. This gets us to a couple other useful windbg commands.

Navigating types

So we do have symbols for the COM code handling the RPC message. That being the case, there is another stack trace command we can use to get us the parameters in a more friendly way - ~0kP

0:000> ~0kP
ChildEBP RetAddr  
0028ea2c 7746112f ntdll!NtWaitForMultipleObjects+0xc
0028ebb8 752dd433 KERNELBASE!WaitForMultipleObjectsEx+0xcc
0028ec1c 76f2fab4 user32!MsgWaitForMultipleObjectsEx+0x163
0028ec54 7702b50c combase!CCliModalLoop::BlockFn(
            void ** ahEvent = 0x0028ec80, 
            unsigned long cEvents = 1, 
            unsigned long * lpdwSignaled = 0x0028ec7c)+0x111 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1571]
(Inline) -------- combase!ModalLoop+0x52 [d:\9147\com\combase\dcomrem\chancont.cxx @ 129]
(Inline) -------- combase!ClassicSTAThreadWaitForCall+0x52 [d:\9147\com\combase\dcomrem\threadtypespecific.cpp @ 172]
0028ecc0 7702a259 combase!ThreadSendReceive(
            class CMessageCall * pCall = 0x00505170, 
            struct _GUID * rmoid = 0x00509fb4 {70622196-bea4-9302-ac70-cabc069d1e23})+0x1d3 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5776]
(Inline) -------- combase!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xd7 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5090]
0028ee0c 76f2fe0d combase!CRpcChannelBuffer::SendReceive2(
            struct tagRPCOLEMESSAGE * pMessage = 0x0028ef84, 
            unsigned long * pstatus = 0x0028ef4c)+0x1e9 [d:\9147\com\combase\dcomrem\channelb.cxx @ 4796]
(Inline) -------- combase!ClientCallRetryContext::SendReceiveWithRetry+0x31 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1090]
(Inline) -------- combase!CAptRpcChnl::SendReceiveInRetryContext+0x3b [d:\9147\com\combase\dcomrem\callctrl.cxx @ 715]
0028eecc 76f0c65d combase!ClassicSTAThreadSendReceive(
            class CAptRpcChnl * pChannel = 0x005c9c54, 
            struct tagRPCOLEMESSAGE * pMsg = 0x0028ef84, 
            unsigned long * pulStatus = 0x0028ef4c)+0x21d [d:\9147\com\combase\dcomrem\callctrl.cxx @ 696]
(Inline) -------- combase!CAptRpcChnl::SendReceive+0x89 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 752]
0028ef30 7702a010 combase!CCtxComChnl::SendReceive(
            struct tagRPCOLEMESSAGE * pMessage = 0x0028ef84, 
            unsigned long * pulStatus = 0x0028ef4c)+0x105 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 790]
0028ef54 76dc5769 combase!NdrExtpProxySendReceive(
            void * pThis = 0x02b0f2c4, 
            struct _MIDL_STUB_MESSAGE * pStubMsg = 0x0028efb0)+0x5c [d:\9147\com\combase\ndr\ndrole\proxy.cxx @ 2017]
0028ef6c 76e46c1b rpcrt4!NdrpProxySendReceive+0x29
0028f398 77029e1e rpcrt4!NdrClientCall2+0x22b
0028f3b8 76f0c46f combase!ObjectStublessClient(
            void * ParamAddress = 0x0028f3d0, 
            long Method = 0n3)+0x6c [d:\9147\com\combase\ndr\ndrole\i386\stblsclt.cxx @ 215]
0028f3c8 7494a5c6 combase!ObjectStubless(void)+0xf [d:\9147\com\combase\ndr\ndrole\i386\stubless.asm @ 171]
0028f440 74d0627f fastprox!CEnumProxyBuffer::XEnumFacelet::Next+0xd6
0028f484 76cca040 wbemdisp!CSWbemObjectSet::get_Count+0xdf
0028f4a0 76cca357 oleaut32!DispCallFunc+0x16f
0028f748 74cfe81f oleaut32!CTypeInfo2::Invoke+0x2d7
0028f784 74d0876e wbemdisp!CDispatchHelp::Invoke+0xaf
0028f7b8 6ab901e2 wbemdisp!CSWbemDateTime::Invoke+0x3e
WARNING: Stack unwind information not available. Following frames may be wrong.
0028f8b8 6ab90bde win32ole+0x101e2
0028f988 668e3539 win32ole+0x10bde
0028fa18 668ef3f8 msvcrt_ruby200!rb_error_arity+0x129
0028fab8 668ef712 msvcrt_ruby200!rb_f_send+0x518
0028fb38 668e7fdd msvcrt_ruby200!rb_f_send+0x832
0028fc68 668ec1a4 msvcrt_ruby200!rb_vm_localjump_error+0x260d
0028fd78 668f4f71 msvcrt_ruby200!rb_vm_localjump_error+0x67d4
0028fdd8 667c4651 msvcrt_ruby200!rb_iseq_eval_main+0x131
0028fe58 667c6b3d msvcrt_ruby200!rb_check_copyable+0x37d1
0028fe88 0040287f msvcrt_ruby200!ruby_run_node+0x2d
0028feb8 004013fa ruby+0x287f
0028ff80 77087c04 ruby+0x13fa
0028ff94 7765ad1f kernel32!BaseThreadInitThunk+0x24
0028ffdc 7765acea ntdll!__RtlUserThreadStart+0x2f
0028ffec 00000000 ntdll!_RtlUserThreadStart+0x1b

Maybe still not the most beautiful output, but look at those combase calls. We have types and parameter names! Amazing! So to get the process ID of the service handling the call, we want to look at the CMessageCall passed to ThreadSendReceive. We can dig into CMessageCall using the dt command. It takes the symbol to inspect and its address:

0:000> dt CMessageCall 0x00505170
Ambiguous matches found for CMessageCall (dumping largest sized):
    combase!CMessageCall {0x004 bytes}
    combase!CMessageCall {0x100 bytes}
combase!CMessageCall
   +0x000 __VFN_table : 0x76efb4c0 
   =7703e09c s_uCurrentCallTraceId : 0x23ac7b
   +0x004 __VFN_table : 0x76efb53c 
   +0x008 __VFN_table : 0x76efb52c 
   +0x00c _pNextMessage    : (null) 
   =76ef0000 Kind             : 0x905a4d (No matching name)
   +0x010 _callcat         : 1 ( CALLCAT_SYNCHRONOUS )
   +0x014 _iFlags          : 0x202
   +0x018 _hResult         : 0n0
   +0x01c _hEvent          : 0x00000260 Void
   +0x020 _hWndCaller      : (null) 
   +0x024 _ipid            : _GUID {0009042d-030c-0000-5ed7-a7ed59ca66b0}
   +0x034 _hSxsActCtx      : 0xffffffff Void
   +0x038 _server_fault    : 0
   +0x03c _destObj         : CDestObject
   +0x048 _pHeader         : 0x005bfdb8 Void
   +0x04c _pLocalThis      : 0x005bfdd8 WireLocalThis
   +0x050 _pLocalThat      : (null) 
   +0x054 _pHandle         : 0x005941a0 CChannelHandle
   +0x058 _hRpc            : 0x02b68f18 Void
   +0x05c _pContext        : (null) 
   +0x060 _dwCOMHeaderSize : 0x30
   +0x064 message          : tagRPCOLEMESSAGE
   +0x090 _iidWire         : _GUID {423ec01e-2e35-11d2-b604-00104b703efd}
   +0x0a0 _guidCausality   : _GUID {1a786d47-0267-499f-b9e9-6678d3e29858}
   +0x0b0 _dwServerPid     : 0x30c
   +0x0b4 _dwServerTid     : 0
   +0x0b8 _iMethodWire     : 3
   +0x0bc _dwErrorBufSize  : 0
   +0x0c0 _pASTAIncomingCallData : (null) 
   +0x0c4 m_ulCancelTimeout : 0xffffffff
   +0x0c8 m_dwStartCount   : 0xfb0049
   +0x0d0 m_pClientCtxCall : (null) 
   +0x0d4 m_pServerCtxCall : (null) 
   +0x0d8 _bCheckedClientIsAppContainer : 0y0
   +0x0d8 _bClientIsAppContainer : 0y0
   +0x0d8 _bServerIsAppContainer : 0y0
   +0x0d8 _bCheckedClientIsRpcss : 0y0
   +0x0d8 _bClientIsRpcss  : 0y0
   +0x0d8 _bCheckedClientIsStrongNamedProcess : 0y0
   +0x0d8 _bClientIsStrongNamedProcess : 0y0
   +0x0d8 _bClientCalledGetBuffer : 0y1
   +0x0d9 _bClientCalledSendOrSendReceive : 0y1
   +0x0d9 _bClientCalledReceiveOrSendReceive : 0y1
   +0x0d9 _bClientReceiveOrSendReceiveCompletedSuccessfully : 0y0
   +0x0d9 _bSuppressClientStopTracing : 0y0
   +0x0dc _uCallTraceId    : 0x23ac7b
   +0x0e0 _hrClient        : 0
   +0x0e4 _sourceOfHResult : 0 ( Unknown )
   +0x0e8 _pInBiasContainer : (null) 
   +0x0ec _pOutBiasContainer : (null) 
   +0x0f0 _pushTlsMitigation : PushTlsPreventRundownMitigation
   +0x0f8 _bEnabledInprocOutParameterMitigation : 0y0
   +0x0f8 _bCheckedConditionsForOutofprocOutParameterMitigation : 0y0
   +0x0f8 _bShouldEnableOutofprocOutParameterMitigation : 0y0
   +0x0f8 _bScheduledAcknowlegmentOfOutParameterMarshalingSet : 0y0
   +0x0fc _pServerOXIDEntry : 0x004fbe68 OXIDEntry

So the key property here is _dwServerPid with a value of 0x30c, which translates to a decimal value of 780. This was a svchost.exe process which can host several windows services. This process was hosting the Windows Management Instrumentation service.

To be honest at this point my energy to dive into that process to see if I could find which spawned WmiPrvSE.exe was at fault had fizzled. There had been alot of other troubleshooting beyond what we have looked at here leading me to believe that there was likely nothing helpful on the WmiPrvSE side that would be useful. The actionable bit of data I could walk away from here is to include a timeout in the wmi-lite query so that if WMI queries hang in the future, the process will at least fail allowing subsequent chef runs to proceed.

Before ending this post, I would like to cover a couple other points that may be helpful in other investigations.

Waiting handles

One dead giveaway from this stack that things are stuck is the WaitForMultipleObjects call at the very top.

0028ea2c 7746112f 00000002 0028ebe0 00000001 ntdll!NtWaitForMultipleObjects+0xc

Just google this name and you will find reams of posts, some well worth the read, talking about hangs. We can even examine the waiting handle. Its the second parameter sent to that call. First we need the handle address that argument points to. We will use dd 0028ebe0 L1 to get the very first dword:

0:000> dd 0028ebe0 L1
0028ebe0  00000260

This gives us the handle address which we retrive using the !handle command:

0:000> !handle 00000260  f
Handle 00000260
  Type          Event
  Attributes    0
  GrantedAccess 0x1f0003:
         Delete,ReadControl,WriteDac,WriteOwner,Synch
         QueryState,ModifyState
  HandleCount   2
  PointerCount  35085
  Name          <none>
  Object specific information
    Event Type Auto Reset
    Event is Waiting

You definitely want to use the f parameter other wise you just get the handle type. So we see here the thread is waiting.

Locks, lock owners and orphans

In another hang I had some dumps of the WmiPrvSE processes on the machine at the time that ruby was hanging. In this dump was the following thread:

   3  Id: 1dd0.293c Suspend: 0 Teb: 7f28f000 Unfrozen
ChildEBP RetAddr  Args to Child              
0385dfa4 7724df87 00000380 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
0385e018 7724de0b 00000001 0366e230 0366e260 ntdll!RtlpWaitOnCriticalSection+0xd0
0385e040 7724de35 0385e064 73f33450 0366e260 ntdll!RtlpEnterCriticalSectionContended+0xa0
0385e048 73f33450 0366e260 02e1bb48 0366e230 ntdll!RtlEnterCriticalSection+0x42
0385e064 73f33721 3a369589 00000000 00000000 signdrv!CWhqlObj::WalkTree+0x18
0385e930 73f3364d 02df2da8 02e1bb48 00000000 signdrv!CWhqlObj::CreateList+0x4b
0385e97c 00106d8a 0366e238 02e1b394 00000000 signdrv!CWhqlObj::CreateInstanceEnumAsync+0xdd
0385e9d8 00106a26 00000000 02dc50b4 00000000 WmiPrvSE!CInterceptor_IWbemSyncProvider::Helper_CreateInstanceEnumAsync+0x2c1
0385ea34 75c8676f 02dd8b70 02dc50b4 00000000 WmiPrvSE!CInterceptor_IWbemSyncProvider::CreateInstanceEnumAsync+0xf6
0385ea5c 75d071fc 00106930 0385eca8 00000005 rpcrt4!Invoke+0x34
0385f0e0 74fa868d 02dbd098 02dc4154 02db95b4 rpcrt4!NdrStubCall2+0x32c
0385f12c 75c6147c 02dbd098 02db95b4 02dc4154 combase!CStdStubBuffer_Invoke+0x99 [d:\9147\com\combase\ndr\ndrole\stub.cxx @ 1590]
0385f148 74072f08 02dbd098 02db95b4 02dc4154 rpcrt4!CStdStubBuffer_Invoke+0x2c
0385f164 74fa853f 02dbda6c 02db95b4 02dc4154 fastprox!CBaseStublet::Invoke+0x38
0385f1f0 74e8c086 02dbb228 000f3888 02dc4154 combase!SyncStubInvoke+0x14c [d:\9147\com\combase\dcomrem\channelb.cxx @ 1664]
(Inline) -------- -------- -------- -------- combase!StubInvoke+0x9e [d:\9147\com\combase\dcomrem\channelb.cxx @ 1957]
0385f31c 74fa9600 02dc4154 02db95b4 02dbda6c combase!CCtxComChnl::ContextInvoke+0x236 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 1377]
(Inline) -------- -------- -------- -------- combase!DefaultInvokeInApartment+0xffffffe8 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 2716]
0385f3c4 74fa8de0 02dbda6c 02dd8b70 02de2de0 combase!AppInvoke+0x415 [d:\9147\com\combase\dcomrem\channelb.cxx @ 1481]
0385f520 74fa964f 74fa9710 00000000 02dc0ed0 combase!ComInvokeWithLockAndIPID+0x605 [d:\9147\com\combase\dcomrem\channelb.cxx @ 2311]
0385f57c 75c867cd 02dc0ed0 3c14e0c5 00000000 combase!ThreadInvoke+0x694 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5488]
0385f5c0 75c8722e 74fa9710 02dc0ed0 0385f670 rpcrt4!DispatchToStubInCNoAvrf+0x4d
0385f630 75c86ff7 02dc0ed0 00000000 00000000 rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x13e
0385f6c4 75c86d3c 0385f6f4 00000000 02dc0e18 rpcrt4!LRPC_SCALL::DispatchRequest+0x226
0385f720 75c86a14 02de2d48 02de3e50 00000000 rpcrt4!LRPC_SCALL::HandleRequest+0x31c
0385f75c 75c8538f 02de2d48 02de3e50 00000000 rpcrt4!LRPC_SASSOCIATION::HandleRequest+0x1fc
0385f824 75c85101 00000000 75c85070 02d9a0c0 rpcrt4!LRPC_ADDRESS::ProcessIO+0x481
0385f860 77250b10 0385f8e4 02db39c4 02d9a0c0 rpcrt4!LrpcIoComplete+0x8d
0385f898 77250236 0385f8e4 02d9a0c0 00000000 ntdll!TppAlpcpExecuteCallback+0x180
0385fa38 755d7c04 02d981d8 755d7be0 3cb70784 ntdll!TppWorkerThread+0x33c
0385fa4c 7728ad1f 02d981d8 3ee82643 00000000 kernel32!BaseThreadInitThunk+0x24
0385fa94 7728acea ffffffff 7727022d 00000000 ntdll!__RtlUserThreadStart+0x2f
0385faa4 00000000 77244a00 02d981d8 00000000 ntdll!_RtlUserThreadStart+0x1b

This thread jumped out at me because it was waiting to obtain a lock to run a critical section. We can get more info on the critical section by using the !critsec command with the address of the first parameter passed to RtlEnterCriticalSection:

0:000> !critsec 0366e260

CritSec +366e260 at 0366e260
WaiterWoken        No
LockCount          1
RecursionCount     1
OwningThread       2f84
EntryCount         0
ContentionCount    1
*** Locked

We can also use the !locks command:

0:000> !locks

CritSec +366e260 at 0366e260
WaiterWoken        No
LockCount          1
RecursionCount     1
OwningThread       2f84
EntryCount         0
ContentionCount    1
*** Locked

Scanned 13 critical sections

So the kind of obvious next question one might ask themselves now is: what is thread 2f84 doing? Probably nothing productive. However, a listing of threads:

0:000> ~
.  0  Id: 1dd0.9b4 Suspend: 0 Teb: 7f3be000 Unfrozen
   1  Id: 1dd0.2c64 Suspend: 0 Teb: 7f3bb000 Unfrozen
   2  Id: 1dd0.3148 Suspend: 0 Teb: 7f289000 Unfrozen
   3  Id: 1dd0.293c Suspend: 0 Teb: 7f28f000 Unfrozen

shows that thread has gone missing. This probably means one of two things:

  1. We are in the middle of a context switch and owner 2f84 has finished.
  2. 2f84 encountered an access violation and died leaving our thread here orphaned. Tragic but very possible.

Coincidentally, more inspection of this thread shows that it was doing something with our WMI class friend Win32_PNPSignedDriver. This very well could have been why the ruby thread in this case was hanging. It was hard to say because other hangs did not reveal this same pattern.

So I know I've looked at just about enough memory dumps to last me a good while and I hope that this has been equally satisfying to other readers.

Oh...and...hi future self! I'm thinking about you and hoping you a quick investigation.

I'm so excited. I'm about to join Chef and I think I like it! by Matt Wrock

A little over a year ago, a sparkly devops princess named Jennifer Davis came to my workplace at CenturyLink Cloud and taught me a thing or two about this thing she called Chef.  I had been heads down for the previous 3 years trying to automate windows. While wandering in the wilderness of windows automation I had previously bumped into Chocolatey where I met my good friend Rob Reynolds who had told me stories about these tools called Puppet, Chef, and Vagrant. Those were the toys that the good little girls and boys on the other side of the mountain (ie linux land) used for their automation. Those tools sounded so awesome but occupied with my day job at Microsoft that was not centered around  automation and my nights and weekends developing Boxstarter, I had yet to dabble in these tools myself.

So when the opportunity came to join a cloud computing provider CenturyLink Cloud and focus on data center automation for a mixed windows and linux environment, I jumped at the opportunity. Thats where Jennifer entered the scene and opened a portal door leading from my desert to a land of automation awe and wonder and oh the things I saw. Entire communities built, thriving and fighting all about transforming server configuration from lines of code. To these people, "next" and "finish" were commands and not buttons. These were my people and with them I wanted to hang.

There were windows folk I spied through the portal. Jeffrey Snover had been there for quite some time in full robe and sandals pulling windows folk kicking and screaming through to the other side of the portal. Steven Murauwski was there too translating Jeffrey's teachings and pestering Jeffrey's disciples to write tests because that's how its done. Rob had been sucked through the portal door at least a year prior and was working for Puppet. Hey look he's carrying a Mac just like all the others. Is that some sort of talisman? Is it the shared public key needed to unveil their secrets? I don't know...but its so shiny.

So what the heck, I thought. Tonight we'll put all other things aside and I followed Jennifer through the portal and never looked back. There was much ruby to be learned and written. Its not that different from powershell and had many similarities to C# but it was indeed its own man and would occasionally hit back when I cursed it. One thing I found was that even more fun that using Chef to automate CenturyLink data centers, was hacking on and around the automation tooling like Chef, Vagrant, Ruby's WinRM implementation, Test-Kitchen, etc.

There was so much work to be done here to build a better bridge between windows and the automation these tools had to offer. This was so exciting and Chef seemed to be jumping all in to this effort as if proclaiming I know, I know, I know, I know, I want you, I want you. Before long Steven took up residence with Chef and Snover seemed to be coming over for regular visits where he was working with Chef to feed multitudes on just a couple loaves of DSC. I helped a little here and there along with a bunch of other smart and friendly people.

Well meanwhile on the CenturyLink farm, we made alot of progress and added quite a bit of automation to building out our data centers. Recently things have been winding down for me on that effort and it seemed a good time to asses my progress and examine new possibilities. Then it occured to me, if Jennifer can be a sparkly devops princess at chef, could I be one too?

We may not yet know the answer to that question yet but I intend to explore it and enter into the center sphere of its celestial force.

So I'll see my #cheffriends Monday. I'm so excited and I just cant hide it!

A dirty check for Chef cookbooks: testing everything that has changed and only when it is changed by Matt Wrock

I was recently pinged by a #cheffriend of mine, Alex Vinyar. A few months back, I sent him a bit of code that I wrote at CenturyLink Cloud that runs on our build agents to determine which cookbooks to test after a cookbook commit is pushed to source control. He just mentioned to me that he has gotten some positive feedback on that code so I thought I'd blog on the details of that for others who may be interested. This can also be considered a follow up to my post from almost a year ago on cookbook dependency management where I shared some theory I was pondering at the time on this subject. This post covers how I followed up on that theory with working code.

I'm going to cover two key strategies:

  1. Testing all cookbooks that depend on a changed cookbook
  2. Ensuring that once a cookbook is tested and succeeds, it is never tested again as long as it and its dependencies do not change

I'll share the same code I shared with Alex (and more) in a bit.

Not enough to test just the cookbook changed

Lets consider this scenario: You manage infrastructure for a complex distributed application. You maintain about a baker's dozen cookbooks - see what I did there?...baker...chef...look at me go girl! You have just committed and pushed a change to one of your lower level library cookbooks that exposes functions used by several of your other cookbooks. We'd expect your pipeline to run tests on that library cookbook but is that enough? What if your upstream cookbooks that depend on the library cookbook "blow up" when they converge with the new changes. This is absolutely possible and not just for library cookbooks.

I wanted a way for a cookbook commit to trigger not only its own chefspecs and test-kitchen tests but also the tests of other cookbooks that depend on it.

Kitchen tests provide slow but valuable feedback, don't run more than needed

Here is another scenario: You are now testing the entire dependency tree touched by a change...yay! But failure happens and your build fails. However the failure may not have occurred at the root of your tree (the cookbook that actually changed). Here are some possibilities:

  1. Maybe there was some transient network glitch that caused a upstream cookbook to fail
  2. Gem dependency changes outside of your own managed cookbook collection (you know - the dark of semver - minor build bumps that are really "major")
  3. Your build has batched more than one root cookbook change. One breaks but the others pass

So an hour's worth (or more) of testing 10 cookbooks results in a couple broken kitchen tests. You make some fixes, push again and have to wait for the entire set to build/test all over again.

I have done this before. Just look at my domain name!

However many of these tests may be unnecessary now. Some of the tests that has succeeded in the filed batch may be unaffected by the fix.

Finding dependencies

So here is the class we use at CenturyLink Cloud to determine when one cookbook as been affected by a recent change. This is what I sent to Alex. I'll walk you through some of it later.

require 'berkshelf'
require 'chef'
require 'digest/sha1'
require 'json'

module Promote
  class Cookbook
    def initialize(cookbook_name, config)
      @name = cookbook_name
      @config = config
      @metadata_path = File.join(path, 'metadata.rb')
      @berks_path = File.join(path, "Berksfile")
    end

    # List of all dependent cookbooks including transitive dependencies
    #
    # @return [Hash] a hash keyed on cookbook name containing versions
    def dependencies
      @dependencies ||= get_synced_dependencies_from_lockfile
    end

    # List of cookbooks dependencies explicitly specified in metadata.rb
    #
    # @return [Hash] a hash keyed on cookbook name containing versions
    def metadata_dependencies
      metadata.dependencies
    end

    def version(version=nil)
      metadata.version
    end

    def version=(version)
      version_line = raw_metadata[/^\s*version\s.*$/]
      current_version = version_line[/('|").*("|')/].gsub(/('|")/,"")
      if current_version != version.to_s
        new_version_line = version_line.gsub(current_version, version.to_s)
        new_content = raw_metadata.gsub(version_line, new_version_line)
        save_metadata(new_content)
      end
      version
    end

    # Stamp the metadata with a SHA1 of the last git commit hash in a comment
    #
    # @param [String] SHA1 hash of commit to stamp
    def stamp_commit(commit)
      commit_line = "#sha1 '#{commit}'"
      new_content = raw_metadata.gsub(/#sha1.*$/, commit_line)
      if new_content[/#sha1.*$/].nil?
        new_content += "\n#{commit_line}"
      end
      save_metadata(new_content)
    end

    def raw_metadata
      @raw_metadata ||= File.read(@metadata_path)
    end

    def metadata
      @metadata ||=get_metadata_from_file
    end

    def path
      File.join(@config.cookbook_directory, @name)
    end

    # Performs a berks install if never run, otherwise a berks update
    def sync_berksfile(update=false)
      return if berksfile.nil?

      update = false unless File.exist?("#{@berks_path}.lock")
      update ? berksfile_update : berksfile.install
    end

    # Determine if our dependency tree has changed after a berks update
    #
    # @return [TrueClass,FalseClass]
    def dependencies_changed_after_update?
      old_deps = dependencies
      sync_berksfile(true)
      new_deps = get_dependencies_from_lockfile

      # no need for further inspection is 
      # number of dependencies have changed
      return true unless old_deps.count == new_deps.count

      old_deps.each do |k,v|
        if k != @name
          return true unless new_deps[k] == v
        end
      end

      return false
    end

    # Hash of all dependencies. Guaranteed to change if 
    # dependencies of their versions change
    #
    # @param [environment_cookbook_name] Name of environmnt cookbook
    # @return [String] The generated hash
    def dependency_hash(environment_cookbook_name)
      # sync up with the latest on the server
      sync_latest_app_cookbooks(environment_cookbook_name)
      
      hash_src = ''
      dependencies.each do | k,v |
        hash_src << "#{k}::#{v}::"
      end
      Digest::SHA1.hexdigest(hash_src)
    end

    # sync our environmnt cookbook dependencies 
    # with the latest on the server
    #
    # @return [Hash] the list of cookbooks updated
    def sync_latest_app_cookbooks(environment_cookbook_name)
      result = {}

      #Get a list of all the latest cookbook versions on the server
      latest_server_cookbooks = chef_server_cookbooks(1)
      
      env_cookbook = Cookbook.new(environment_cookbook_name, @config)
      
      # iterate each cookbook in our environment cookbook
      env_cookbook.metadata_dependencies.each do |key|
        cb_key = key[0] # cookbook name

        # skip if its this cookbook or if its not on the server
        next if cb_key == @name || !latest_server_cookbooks.has_key?(cb_key)
        latest_version = latest_server_cookbooks[cb_key]['versions'][0]['version']
        
        # if the server as a later version than this cookbook
        # we do a berks update to true up with the server
        if dependencies.keys.include?(cb_key) &&
          latest_version > dependencies[cb_key].to_s
          
          berksfile_update(cb_key)
          result[cb_key] = latest_version
        end
      end

      # return the updated cookbooks
      result
    end

    private

    def chef_server_cookbooks(versions = 'all')
      Chef::Config[:ssl_verify_mode] = :verify_none
      rest = Chef::REST.new(
        @config.chef_server_url, 
        @config.node_name, 
        @config.client_key
      )
      rest.get_rest("/cookbooks?num_versions=#{versions}")
    end

    def berksfile
      @berksfile ||= get_berksfile
    end

    def get_berksfile
      return unless File.exist?(@berks_path)
      Berkshelf.set_format :null
      Berkshelf::Berksfile.from_file(@berks_path)
    end

    def save_metadata(content)
      File.open(@metadata_path, 'w') do |out|
        out << content
      end

      # force subsequent usages of metadata to refresh from file
      @metadata=nil
      @raw_metadata = nil
    end

    def get_metadata_from_file
      metadata = Chef::Cookbook::Metadata.new
      metadata.from_file(@metadata_path)
      metadata
    end

    def get_synced_dependencies_from_lockfile
      # install to make sure we have a lock file
      berksfile.install

      get_dependencies_from_lockfile
    end

    # get the version of every dependent cookbook from berkshelf
    def get_dependencies_from_lockfile
      berks_deps = berksfile.list
      deps = {}

      berks_deps.each {|dep| deps[dep.name] = dep.locked_version}
      deps
    end

    def berksfile_update(cookbooks = nil)
      cookbooks.nil? ? berksfile.update : berksfile.update(cookbooks)
      @dependencies = nil
    end
  end
end

This class is part of a larger "promote" gem that we use to promote cookbooks along our delivery pipeline. The class represents an individual cookbook and can be instantiated given an individual cookbook name and a config.

The config class (not shown here) is an extremely basic data class that encapsulates the properties of the chef environment.

Some assumptions

This code will only work given a couple conditions are true. One is fairly common and the other not so much. Even if these conditions do not exist in your environment you should be able to groc the logic.

Using Berkshelf

You are using berkshelf to manage your dependencies. This class uses the berkshelf API to find dependencies of other cookbooks.

Uning the Environment Cookbook Pattern

We use the environment cookbook pattern to govern which cookbooks are considered part of a deployable application. You can read more about this pattern here but an environment cookbook is very simple. It is just a metadata.rb and a Bekshelf.lock. The metadata file includes a depends for each of my cookbooks responsible for infrastructure used in the app. The Berkshelf.lock file uses that to build the entire tree of not only "my own" cookbooks but all third party cookbooks as well.

The dependency hash

For the purposes of this post, the dependency_hash method of the above class is the most crucial. This hash represents all cookbook dependencies and their version at the current point in time. If I were to persist this hash (more on that in a bit) and then later regenerate it with a hash that is at all different from the persisted one, I know something has changed. I don't really care where. Any change at all anywhere in a cookbook's dependency tree means it is at risk of breaking and should be tested.

The "Last Known Good" environment

Every time we test a cookbook and it passes, we store its current dependency hash in a special chef environment called LKG (last known good). This is really a synthetic environment with the same structure as any chef environment but the version constraints may look a bit strange:

{
  name: "LKG_platform",
  chef_type: "environment",
  json_class: "Chef::Environment",
  cookbook_versions: {
    p_rabbitmq: "1.1.15.2d88c011bfb691e9de0df593e3f25f1e5bf7cbba",
    p_elasticsearch: "1.1.79.2dd2472fbd47bf32db4ad83be45acb810b788a90",
    p_haproxy: "1.1.47.75dfc55b772fa0eb08b6fbdee02f40849f37ca77",
    p_win: "1.1.121.fa2a465e8b964a1c0752a8b59b4a5c1db9074ee5",
    p_couchbase: "1.1.92.325a02ab22743eccc17756539d3fac57ce5ef952",
    elasticsearch_sidec: "0.1.7.baab30d04181368cd7241483d9d86d2dffe730db",
    p_orchestration: "0.1.48.3de2368d660a4c7dfd6dba2bc2c61beeea337f70"
  }
}

We update this synthetic environment any time a cookbook test runs and passes. Here is a snippet of our Rakefile that does this:

cookbook = Promote::Cookbook.new(cookbook_name, config)
environment = Promote::EnvironmentFile.new("LKG", config)
lkg_versions = environment.cookbook_versions
cookbook_version = "#{cookbook.version.to_s}.#{hash}"

lkg_versions[cookbook_name] = cookbook_version

environment.write_cookbook_versions(lkg_versions)

This is just using the cookbook and environment classes of our Promote gem to write out the new versions and following this, the environment like all of our environments is committed to source control.

The dirty check

We have now seen how we generate a dependency hash and how we persist that on successful tests. Here is where we actually compare our current dependency hash and the last known hash. We need to determine before running integration tests on a given cookbook if it is any different than the last time we succesfully tested it. If we previously tested it and neither it nor any of its dependencies (no matter how deep) have changed, we really do not need or want to test it again.

def is_dirty(
  source_environment,
  lkg_environment,
  cookbook_name,
  dependency_hash = nil
)
  # Get the environment of the last good tests
  lkg_env_file = EnvironmentFile.new(lkg_environment, config)

  # if this cookbook is not in the last good tests, we are dirty
  unless lkg_env_file.cookbook_versions.has_key?(cookbook_name)
    return true
  end

  # get the current environment under test
  source_env_file = EnvironmentFile.new(source_environment, config)
  
  # get the version of this cookbook from the last good tests
  # it will look like 1.1.15.2d88c011bfb691e9de0df593e3f25f1e5bf7cbba
  # thi is the normal version followed by the dependency hash tested
  lkg_parts = lkg_env_file.cookbook_versions[cookbook_name].split('.')

  # Get the "plain" cookbook version
  lkg_v = lkg_parts[0..2].join('.')
  
  # the dependency hash last tested
  lkg_hash = target_parts[3]
  
  # we are dirty if our plain version or dependency hash has changed
  return source_env_file.cookbook_versions[cookbook_name] != lkg_v || 
    dependency_hash != lkg_hash
end

This method takes the environment currently under test, our LKG environment that has all the versions and their dependency hashes last tested, the cookbook to be tested and its current dependency hash. It returns whether or not our cookbook is dirty and needs testing.

Prior to calling this code, our build pipeline edits our "test" environment (the "source_environment" above). This environment holds the cookbook versions of all cookbooks being tested including the last batch of commits. We use a custom kitchen provisioner plugin to ensure Test-Kitchen uses these versions and not just the versions in each cookbooks berkshelf lock file. As I discussed in my dependency post from last year, its important that all changes are tested together.

So there you have it. Now Hurry up and Wait for your dirty cookbooks to run their tests!

Detecting a hung windows process by Matt Wrock

A couple years ago, I was adding remoting capabilities to boxstarter so that one could setup any machine in their network and not just their local environment. As part of this effort, I ran all MSI installs via a scheduled task because some installers (mainly from Microsoft) would pull down bits via windows update and that always fails when run via a remote session. Sure the vast majority of MSI installers will not need a scheduled task but it was just easier to run all through a scheduled task instead of maintaining a whitelist.

I have run across a couple installers that were not friendly to being run silently in the background. They would throw up some dialog prompting the user for input and then that installer would hang forever until I forcibly killed it. I wanted a way to automatically tell when "nothing" was happening as opposed to a installer actually chugging away doing stuff.

Its all in the memory usage

This has to be more than a simple timeout based solution  because there are legitimate installs that can take hours. One that jumps to mind rhymes with shequel shurver. We have to be able to tell if the process is actually doing anything without the luxury of human eyes staring at a progress bar. The solution I came up with and am going to demonstrate here has performed very reliable for me and uses a couple memory counters to track when a process is truly idle for long periods of time.

Before I go into further detail, lets look at the code that polls the current memory usage of a process:

function Get-ChildProcessMemoryUsage {
    param(
        $ID=$PID,
        [int]$res=0
    )
    Get-WmiObject -Class Win32_Process -Filter "ParentProcessID=$ID" | % { 
        if($_.ProcessID -ne $null) {
            $proc = Get-Process -ID $_.ProcessID -ErrorAction SilentlyContinue
            $res += $proc.PrivateMemorySize + $proc.WorkingSet
            $res += (Get-ChildProcessMemoryUsage $_.ProcessID $res)
        }
    }
    $res
}

Private Memory and Working Set

You will note that the memory usage "snapshot" I take is the sum of PrivateMemorySize and WorkingSet. Such a sum on its own really makes no sense so lets see what that means. Private memory is the total number of bytes that the process is occupying in windows and WorkingSet is the number of bytes that are not paged to disk. So WorkingSet will always be a subset of PrivateMemory, why add them?

I don't really care at all about determining how much memory the process is consuming in a single point in time. What I do care about is how much memory is ACTIVE relative to a point in time just before or after each of these snapshots. The idea is that if these numbers are changing, things are happening. I tried JUST looking at private memory and JUST looking at working set and I got lots of false positive hangs - the counts would remain the same over fairly long periods of time but the install was progressing.

So after experimenting with several different memory counters (there are a few others), I found that if I looked at BOTH of these counts together, I could more reliably use them to detect when a process is "stuck."

Watching the entire tree

You will notice in the above code that the Get-ChildProcessMemoryUsage function recursively tallies up the memory usage counts for the process in question and all child processes and their children, etc. This is because the initial installer process tracked by my program often launches one or more subprocesses that do various bits of work. If I only watch the initial root process, I will get false hang detections again because that process may do nothing for long periods of time as it waits on its child processes.

Measuring the gaps between changes

So we have seen how to get the individual snapshots of memory being used by a tree of processes. As stated before, these are only useful in relationships between other snapshots. If the snapshots fluctuate frequently, then we believe things are happening and we should wait. However if we get a long string where nothing happens, we have reason to believe we are stuck. The longer this string, the more likely our stuckness is a true hang.

Here are memory counts where memory counts of processes do not change:

VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 0
VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 1
VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 2

These are 3 snapshots of the child processes of the 2012 Sql Installer on an Azure instance being installed by Boxstarter via Powershell remoting. The memory usage is the same for all processes so if this persists, we are likely in a hung state.

So what is long enough?

Good question!

Having played and monitored this for quite some time, I have come up with 120 seconds as my threshold - thats 2 minutes for those not paying attention. I think quite often that number can be smaller but I am willing to error conservatively here. Here is the code that looks for a run of inactivity:

function Test-TaskTimeout($waitProc, $idleTimeout) {
    if($memUsageStack -eq $null){
        $script:memUsageStack=New-Object -TypeName System.Collections.Stack
    }
    if($idleTimeout -gt 0){
        $lastMemUsageCount=Get-ChildProcessMemoryUsage $waitProc.ID
        $memUsageStack.Push($lastMemUsageCount)
        if($lastMemUsageCount -eq 0 -or (($memUsageStack.ToArray() | ? { $_ -ne $lastMemUsageCount }) -ne $null)){
            $memUsageStack.Clear()
        }
        if($memUsageStack.Count -gt $idleTimeout){
            KillTree $waitProc.ID
            throw "TASK:`r`n$command`r`n`r`nIs likely in a hung state."
        }
    }
    Start-Sleep -Second 1
}

This creates a stack of these memory snapshots. If the last snapshot is identical to the one just captured, we add the one just captured to the stack. We keep doing this until one of two thing happens:

  1. A snapshot is captured that varies from the last one recorded in the stack. At this point we clear the stack and continue.
  2. The number of snapshots in the stack exceeds our limit. Here we throw an error - we believe we are hung.

I hope you have found this information to be helpful and I hope that none of your processes become hung!

Creating windows base images using Packer and Boxstarter by Matt Wrock

I've written a couple posts on how to create vagrant boxes for windows and how to try and get the base image as small as possible. These posts explain the multiple steps of preparing the base box and then the process of converting that box to the appropriate format of your preferred hypervisor. This can take hours. I cringe whenever I have to refresh my vagrant boxes. There is considerable time involved in each step: downloading the initial ISO from microsoft, installing the windows os, installing updates, cleaning up the image, converting the image to any alternate hypervisor formats (I always make both VirtualBox and Hyper-V images), compacting the image to the vagrant .box file, testing and uploading the images.

Even if all goes smoothly, the entire process can take around 8 hours. There is a lot of babysitting done along the way. Often there are small hiccups that require starting over or backtracking.

This post shows a way that ends much of this agony and adds considerable predictability to a successful outcome. The process can still be lengthy, but there is no baby sitting. Type a command, go to sleep and you wake up with new windows images ready for consumption. This post is a culmination of my previous posts on this topic but on steroids...wait...I mean a raw foods diet - its 100% automated and repeatable.

High level tool chain overview

Here is a synopsis of what this post will walk through:

  • Downloading free evaluation ISOs (180 day) of Windows
  • Using a Packer template To load the ISO in a VirtualBox VM and customize using a Windows Autounattend.xml file.
  • Optimize the image with Boxstarter, installing all windows updates, and shrinking as much as possible.
  • Output a vagrant .box file for creating new VirtualBox VMs with this image.
  • Sharing the box with others using Atlas.Hashicorp.com

tl;dr

If you want to quickly jump into things and forgo the rest of this post or just read it later, just read the bit about instaling packer or go right to their download page and then clone my packer-templates github repo. This has the packer template, and all the other artifacts needed to build a windows 2012 R2 VirtualBox based Vagrant box for windows. You can study this template and its supporting files to discover what all is involved.

Want Hyper-V?

For the 7 other folks out there that use Hyper-V to consume their devops tool artifacts, I recently blogged how to create a Hyper-V vagrant .box file from a Virtualbox hard disk (.vdi or vmdk). This post will focus of creating the VirtualBox box file, but you can read my earlier post to easily turn it into a Hyper-V box.

Say hi to Packer

Hi Packer.

In short, Packer is a tool that assists in the automation of machine images. Many are familiar with Vagrant, made by the same outfit - Hashicorp - which has become an extremely popular platform for creating VMs. Making the spinning up of VMs easy to do and easy to share has been huge. But there has remained a somewhat uncharted frontier - automating the creation of the images that make up the VM.

So many of today's automation tooling focuses on bringing a machine from bare OS to a known desired state. However there is still the challenge of obtaining a bare OS and perhaps one that is not so "bare". Rather than spending so many cycles building the base OS image up to desired state on each VM creation, why not just do this once and bake it into the base image?...oh wait...we used to do that and it sucked.

We created our "golden" images. We bowed down before them and worshiped at the alter of the instant environment. And then we forgot how we built the environment and we wept.

Just like Vagrant makes building a VM a source controlled artifact, Packer does the same for VM templates and like Vagrant, its built on a plugin architecture allowing for the creation of just about all the common formats including containers.

Installing Packer

Installing packer is super simple. Its just a collection of .exe files on windows and regardless of platform, its just an archive that needs to be extracted and then you simply add the extraction target to your path. If you are on Windows, do yourself a favor and install using Chocolatey.

choco install packer -y

Thats it. There should be nothing else needed. Especially since release 0.8.1 (the current release at the time of this post), Packer has everything needed to build windows images and no need for SSH.

Packer Templates

At the core of creating images using Packer is the packer template file. This is a single json file that is usually rather small. It orchestrates the entire image creation process which has three primary components:

  • Builders - a set of pluggable components that create the initial base image. Often this is just the bare installation media booted to a machine.
  • Provisioners - these plugins can attach to the built, minimal image above and bring it forward to a desired state.
  • Post-Processors - Components  that take the provisioned image and usually brings it to its final usable artifact. We will be using the vagrant post-processor here which converts the image to a vagrant box file.

Here is the template I am using to build my windows vagrant box:

{
    "builders": [{
    "type": "virtualbox-iso",
    "vboxmanage": [
      [ "modifyvm", "{{.Name}}", "--natpf1", "winrm,tcp,,55985,,5985" ],
      [ "modifyvm", "{{.Name}}", "--memory", "2048" ],
      [ "modifyvm", "{{.Name}}", "--cpus", "2" ]
    ],
    "guest_os_type": "Windows2012_64",
    "iso_url": "iso/9600.17050.WINBLUE_REFRESH.140317-1640_X64FRE_SERVER_EVAL_EN-US-IR3_SSS_X64FREE_EN-US_DV9.ISO",
    "iso_checksum": "5b5e08c490ad16b59b1d9fab0def883a",
    "iso_checksum_type": "md5",
    "communicator": "winrm",
    "winrm_username": "vagrant",
    "winrm_password": "vagrant",
    "winrm_port": "55985",
    "winrm_timeout": "5h",
    "guest_additions_mode": "disable",
    "shutdown_command": "C:/windows/system32/sysprep/sysprep.exe /generalize /oobe /unattend:C:/Windows/Panther/Unattend/unattend.xml /quiet /shutdown",
    "shutdown_timeout": "15m",
    "floppy_files": [
      "answer_files/2012_r2/Autounattend.xml",
      "scripts/postunattend.xml",
      "scripts/boxstarter.ps1",
      "scripts/package.ps1"
    ]
  }],
    "post-processors": [
    {
      "type": "vagrant",
      "keep_input_artifact": true,
      "output": "windows2012r2min-{{.Provider}}.box",
      "vagrantfile_template": "vagrantfile-windows.template"
    }
  ]
}

As stated in the tldr above, you can view the entire repository here.

Lets walk through this.

First you will notice the template includes a builder and a post-processor as mentioned above. It does not use a provisioner. Those are optional and we'll be doing most of our "provisioning" in the builder. I explain more about why later. Note that one can include multiple builders, provisioners, and post-processors. This one is pretty simple but you can find lots of more complex examples online.

  • type - This uses the virtualbox-iso builder which takes an ISO install file and produces VirtualBox .ovf and .vmdk files. (see packer documentation for information on the other built in builders).
  • vboxmanage - config sent directly to Virtualbox:
    • natpf1 - very helpful if building on a windows host where the winrm ports are already active. Allows you to define port forwarding to winrm port.
    • memory and cpus - these settings will speed things up a bit.
  • iso_url: The url of the iso file to load. This is the windows install media. I downloaded an eval version of server 2012 R2 (discussed below). This can be either an http url that points to the iso online or an absolute file path or file path relative to the current directory. I keep the iso file in an iso directory but because it is so large I have added that to my .gitignore.
  • iso_checksum and iso_checksum_type: These serve to validate the iso file contents. More on how to get these values later.
  • winrm values: as of version 0.8.0, packer comes with a winrm communicator. This means no need to install an SSH server. I use the vagrant username and password because this will be a vagrant box and those are the default credentials used by vagrant. Note that above in the vboxmanage settings I forward port 5985 to 55985 on the guest. 5985 is the default http winrm port so I need to specify 55985 as the winrm port I am using. The reason I am using a non default port is because the host I am using has winrm enabled and listens on 5985. If you are not on a windows host, you can probably just use the default port but that would conflict on my host. I specify a 5 hour timeout for winrm. This is the amount of time that packer will wait at most for winrm to be available. This is very important and I will discuss why later.
  • guest_additions_mode - By default, the virtualbox-iso builder will upload the latest virtualbox guest additions to the box. For my purposes I do not need this and it just adds extra time, takes more space and I have also had intermittent errors while the file is uploaded which hoses the entire build.
  • shutdown_command: This is the command to use to shutdown the machine. Different operating systems may require different commands. I am invoking sysprep which will shutdown the machine when it ends. Syprep when called with /generalize like I am doing here will strip the machine of security identifiers, machine name and other elements that that make it unique. This is particularly useful if you plan to use the image in an environment where many machines may be provisioned from this template and they need to interact with one another. Without doing this, all machines would have the same name and unique user SIDs which could cause problems especially in domain scenarios.
  • floppy_files: an array of files to be added to a floppy drive and made accessible to the machine. Here these include an answer file, and other files to be used throughout the image preparation.

Obtaining evaluation copies of windows

Many do not know that you can obtain free and fully functioning versions of all the latest versions of windows from Microsoft. These are "evaluation" copies and only last 180 days. However, if you only plan to use the images for testing purposes like I do, that should be just fine.

You can get these from  https://msdn.microsoft.com/en-us/evalcenter.aspx You will just need to regenerate the image at least every 180 days. You are not bound to purchase after 180 days but can easily just download a new ISO.

Finding the iso checksums

As shown above, you will need to provide the checksum and checksum type for the iso file you are using. You can do this using a utility called fciv.exe. You can install it from chocolatey:

choco install fciv -y

Now you can call fciv and pass it the path to any file and it will produce the checksum of that file in md5 format by default but you can pass a different format if desired.

Significance of winrm availability and the winrm_timeout

The basic flow of the virtualbox packer build is as follows:

  1. Boot the machine with the ISO
  2. Wait for winrm to be accessible
  3. As soon as it is accessible via winrm, shutdown the box
  4. start the provisioners
  5. when the provisioners are complete, run the post processors

In a perfect world I would have the windows install process enable winrm early on in my setup. This would result in a machine restart and then invoke provisioners that would perform all of my image bootstrap scripts. However, there are problems with that sequence on windows and the primary culprit is windows updates. I want to have all critical updates installed ASAP. However, windows updates cannot be easily run via a remote session which the provisioners do and once they complete, you will want to reboot the box which can cause the provisioners to fail.

Instead, I install all updates during the initial boot process. This allows me to freely reboot the machine as many times as I need since packer is just waiting for winrm and will not touch the box until that is available so I make sure not to enable winrm until the very end of my bootstrap scripts.

Also these scripts run locally so windows update installs can be performed without issue.

Bootstrapping the image with an answer file

Since we are feeding virtualbox an install iso, if we did nothing else, it would prompt the user for all of the typical windows setup options like locale, admin password, disk partition, etc. Obviously this is all meant to be scripted and unattended and that would just hang the install. This is what answer files are for. Windows uses answer files to automate the the setup process. There are all sorts of options one can provide to customize this process.

My answer file is located in my repo here. Note that it s named AutoUnattend.xml and added to the floppy drive of the booted machine. Windows will load any file named AutoUnattend.xml in the floppy drive and use that file as the answer file. I am not going to go through every line here but do know there are additional options beyond what I have that one can specify. I will cover some of the more important parts.

<UserAccounts>
  <AdministratorPassword>
    <Value>vagrant</Value>
    <PlainText>true</PlainText>
  </AdministratorPassword>
  <LocalAccounts>
      <LocalAccount wcm:action="add">
        <Password>
          <Value>vagrant</Value>
          <PlainText>true</PlainText>
        </Password>
        <Group>administrators</Group>
        <DisplayName>Vagrant</DisplayName>
        <Name>vagrant</Name>
        <Description>Vagrant User</Description>
      </LocalAccount>
    </LocalAccounts>
</UserAccounts>
<AutoLogon>
  <Password>
    <Value>vagrant</Value>
    <PlainText>true</PlainText>
  </Password>
  <Enabled>true</Enabled>
  <Username>vagrant</Username>
</AutoLogon>

This creates the administrator user vagrant with the password vagrant. This is the default vgrant username and password so setting up this admin user will make vagrant box setups easier. This also allows the initial boot to auto logon as this user instead of prompting.

<FirstLogonCommands>
  <SynchronousCommand wcm:action="add">
     <CommandLine>cmd.exe /c C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -File a:\boxstarter.ps1</CommandLine>
     <Order>1</Order>
  </SynchronousCommand>
</FirstLogonCommands>

You can specify multiple SynchronousCommand elements containing commands that should be run when the user very first logs in. I find it easier to read this already difficult to read file by just specifying one powershell file to run and then I'll have that file orchestrate the entire bootstrapping.

This file boxstarter.ps1 is another file in my scripts directory of the repo that I add to the virtualbox floppy. We will look closely at that file in just a bit.

<settings pass="specialize">
  <component xmlns:wcm="http://schemas.microsoft.com/WMIConfig/2002/State" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="Microsoft-Windows-ServerManager-SvrMgrNc" processorArchitecture="amd64" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS">
    <DoNotOpenServerManagerAtLogon>true</DoNotOpenServerManagerAtLogon>
  </component>
  <component xmlns:wcm="http://schemas.microsoft.com/WMIConfig/2002/State" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="Microsoft-Windows-IE-ESC" processorArchitecture="amd64" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS">
    <IEHardenAdmin>false</IEHardenAdmin>
    <IEHardenUser>false</IEHardenUser>
  </component>
  <component xmlns:wcm="http://schemas.microsoft.com/WMIConfig/2002/State" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" name="Microsoft-Windows-OutOfBoxExperience" processorArchitecture="amd64" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS">
  <doNotOpenInitialConfigurationTasksAtLogon>true</DoNotOpenInitialConfigurationTasksAtLogon>
  </component>
</settings>

In short, this customizes the image in such a way that will make it far less likely to cause you to kill yourself or others while actually using the image. So you are totally gonna want to include this.

This prevents the "server manager" from opening on startup and allows IE to actually open web pages without the need to ceremoniously click thousands of times.

A boxstarter bootstrapper

So given the above answer file, windows will install and reboot and then the vagrant user will auto logon. Then a powershell session will invoke boxstarter.ps1. Here it is:

$WinlogonPath = "HKLM:\Software\Microsoft\Windows NT\CurrentVersion\Winlogon"
Remove-ItemProperty -Path $WinlogonPath -Name AutoAdminLogon
Remove-ItemProperty -Path $WinlogonPath -Name DefaultUserName

iex ((new-object net.webclient).DownloadString('https://raw.githubusercontent.com/mwrock/boxstarter/master/BuildScripts/bootstrapper.ps1'))
Get-Boxstarter -Force

$secpasswd = ConvertTo-SecureString "vagrant" -AsPlainText -Force
$cred = New-Object System.Management.Automation.PSCredential ("vagrant", $secpasswd)

Import-Module $env:appdata\boxstarter\boxstarter.chocolatey\boxstarter.chocolatey.psd1
Install-BoxstarterPackage -PackageName a:\package.ps1 -Credential $cred

This downloads the latest version of boxstarter and installs the boxstarter powershell modules and finally installs package.ps1 via a boxstarter install run. You can visit boxstarter.org for more information regarding all the details of what boxstarter does. The key features is that it can run a powershell script, and handle machine reboots by making sure the user is automatically logged back in and that the script (package.ps1 here) is restarted.

Boxstarter also exposes many commands that can tweak the windows UI, enable/disable certain windows options and also install windows updates.

Note the winlogon registry edit at the beginning of the boxstarter bootstraper. Without this, boxstarter will not turn off the autologin when it completes. This is only necessary if running boxstarter from a autologined session like this one. Boxstarter takes note of the current autologin settings before it begins and restores those once it finishes. So in this unique case it would restore to a autologin state.

Here is package.ps1, the meat of our bootstrapping:

Enable-RemoteDesktop
Set-NetFirewallRule -Name RemoteDesktop-UserMode-In-TCP -Enabled True

Write-BoxstarterMessage "Removing page file"
$pageFileMemoryKey = "HKLM:\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management"
Set-ItemProperty -Path $pageFileMemoryKey -Name PagingFiles -Value ""

Update-ExecutionPolicy -Policy Unrestricted

Write-BoxstarterMessage "Removing unused features..."
Remove-WindowsFeature -Name 'Powershell-ISE'
Get-WindowsFeature | 
? { $_.InstallState -eq 'Available' } | 
Uninstall-WindowsFeature -Remove

Install-WindowsUpdate -AcceptEula
if(Test-PendingReboot){ Invoke-Reboot }

Write-BoxstarterMessage "Cleaning SxS..."
Dism.exe /online /Cleanup-Image /StartComponentCleanup /ResetBase

@(
    "$env:localappdata\Nuget",
    "$env:localappdata\temp\*",
    "$env:windir\logs",
    "$env:windir\panther",
    "$env:windir\temp\*",
    "$env:windir\winsxs\manifestcache"
) | % {
        if(Test-Path $_) {
            Write-BoxstarterMessage "Removing $_"
            Takeown /d Y /R /f $_
            Icacls $_ /GRANT:r administrators:F /T /c /q  2>&1 | Out-Null
            Remove-Item $_ -Recurse -Force -ErrorAction SilentlyContinue | Out-Null
        }
    }

Write-BoxstarterMessage "defragging..."
Optimize-Volume -DriveLetter C

Write-BoxstarterMessage "0ing out empty space..."
wget http://download.sysinternals.com/files/SDelete.zip -OutFile sdelete.zip
[System.Reflection.Assembly]::LoadWithPartialName("System.IO.Compression.FileSystem")
[System.IO.Compression.ZipFile]::ExtractToDirectory("sdelete.zip", ".") 
./sdelete.exe /accepteula -z c:

mkdir C:\Windows\Panther\Unattend
copy-item a:\postunattend.xml C:\Windows\Panther\Unattend\unattend.xml

Write-BoxstarterMessage "Recreate [agefile after sysprep"
$System = GWMI Win32_ComputerSystem -EnableAllPrivileges
$System.AutomaticManagedPagefile = $true
$System.Put()

Write-BoxstarterMessage "Setting up winrm"
Set-NetFirewallRule -Name WINRM-HTTP-In-TCP-PUBLIC -RemoteAddress Any
Enable-WSManCredSSP -Force -Role Server

Enable-PSRemoting -Force -SkipNetworkProfileCheck
winrm set winrm/config/client/auth '@{Basic="true"}'
winrm set winrm/config/service/auth '@{Basic="true"}'
winrm set winrm/config/service '@{AllowUnencrypted="true"}'

The goals of this script is to fully patch the image and then to get it as small as possible. Here is the breakdown:

  • Enable Remote Desktop. We do this with a bit of shame but so be it.
  • Remove the page file. This frees up about a GB of space. At the tail end of the script we turn it back on which means the page file will restore itself the first time this image is run after this entire build.
  • Update the powershell execution policy to unrestricted because who likes restrictions? You can set this to what you are comfortable with in your environment but if you do nothing, powershell can be painful.
  • Remove all windows features that are not enabled. This is a new feature in 2012 R2 called feature on demand and can save considerable space.
  • Install all critical windows updates. There are about 118 at the time of this writing.
  • Restart the machine if reboots are pending and the first time this runs, they will be.
  • Run the DISM cleanup that cleans the WinSxS folder of rollback files for all the installed updates. Again this is new in 2012 R2 and can save quite a bit of space. Warning: it also takes a long time but not nearly as long as the updates themselves.
  • Remove some of the random cruft. This is not a lot but why not get rid of what we can?
  • Defragment the hard drive and 0 out empty space. This will allow the final act of compression to do a much better job compressing the disk.
  • Lastly, and it is important this is last, enable winrm. Remember that once winrm is accessible, packer will run the shutdown command and in our case here that is the sysprep command:
C:/windows/system32/sysprep/sysprep.exe /generalize /oobe /unattend:C:/Windows/Panther/Unattend/unattend.xml /quiet /shutdown

This will cause a second answer file to fire after the machine boots next. That will not happen until after this image is built, likely just after a "vagrant up" of the image. That file can be found here and its much smaller than the initial answer file that drove the windows install. This second unattend file mainly ensures that the user will not have to reset the admin password at initial startup.

Packaging the vagrant file

So now we have an image and the builders job is done. On my machine this all takes just under 5 hours. Your mileage may vary but make sure that your winrm_timeout is set appropriately. Otherwise if the timeout is less that the length of the build, the entire build will fail and be forever lost.

The vagrant post-processor is what generates the vagrant box. It takes the artifacts of the builder and packges them up. There are other post-processors available, you can string several together and you can create your own custom post-processors. You can make your own provisioners and builders too.

One post processor worth looking at is the atlas post-processor which you can add after the vagrant post processor and it will upload your vagrant box to atlas and now you can share it with others.

Next steps

I have just gotten to the point where this is all working. So this is rough around the edges but it does work. There are several improvements to be made here:

  • Make a boxstarter provisioner for packer. This could run after the initial os install in a provisioner and AFTER winrm is enabled. I would have to leverage boxstarter's remoting functionality so that it does not fail when interacting with windows updates over a winrm session. One key advantage is that the boxstarter output would bubble up to the console running packer giving much more visibility to what is happening with the build as it is running. As it stands now, all output can only be seen in the virtualbox GUI which will not appear in headless environments like in an Atlas build.
  • As stated at the beginning of this post, I create both virtualbox and Hyper-V images. The Hyper-V conversion could use its own post-processor. Now I simply run this in a separate powershell command.
  • Use variables to make the scripts reusable. I'm just generating a single box now but I will definitely be generating more: client SKUs, server core, nano, etc. Packer allows you to specify variables and better templatize the build.

Thanks Matt Fellows and Dylan Meissner

I started looking into all of this a couple months ago and then got sidetracked until last week. In my initial investigation in may, packer 0.8.1 was not released and there was no winrm support out of the box. Matt and Dylan both have contributed to those efforts and also provided really helpful pointers as I was having issues getting things working right.