Troubleshooting Ruby hangs on windows with windbg / by Matt Wrock

Ahhh windbg...those who are familiar with it know that you can't live with it and you can't live without it. Unless you are a hardened windows c++ dev (I am not), if you have used windbg, it was in a moment of desperation when all of the familiar tools just wouldn't cut it. So now whenever you hear the word "windbg", it conjours up the archetype of hard to use, arcane, engineering tooling. Step aside git and vi, windbg kicks your ass when it comes to commands that were intended to be forgotten. Yet here's the thing, windbg has saved your but. It allowed you to see things nothing else could show you. Did git or vi provide this value?...ok...maybe they did...but still!!

Many of us have personal windbg stories. These usually involve some horrendous production outage and are mixed with the drama of figuring out a tool that was not meant to just be picked up and mastered. I have several of these stories in my high volume web development days. Here are a couple:

  1. A frequently called function uses a new regex causing CPU to spike accross all web nodes.
  2. Web site becomes unresponsive because the asp.net thread pool is saturated with calls to a third party web service that is experiencing latency issues.

I survived these events and others thanks to windbg. Thanks windbg! But I still hate you and I think I always will.

windbg and ruby?

So why am I talking about windbg and ruby in the same post? Are they not from two different worlds that should remain apart? Well I have now had two separate incidents in the past few months where I have needed to use windbg with a problem in ruby code. Both of these incidents are similar in that they both differ from my previous run ins with windbg where I was debugging managed code and they both were related to a similar Chef related problem.

When it comes to ruby, its all unmanaged code and you may not know the difference between managed and unmanaged code in the windows world and that is totally ok, but if you do, I can tell you there are extensions you can use with windbg (like sos.dll) to make your life easier. Sorry, those do not apply here. Its all native from here on out!

I've blogged long ago about debugging native .net code with windbg, bit one, thats a super old post and two, I had to discover new "go to" commands for dealing with unmanaged memory dumps.

Lets talk like we are 2

As in two years old. First, two year olds are adorable. But the real reason is that's pretty much the level where I operate and whatever windbg maturity I gained over the past few days will be lost the next time I need it. I'm writing this post to my future self who has fewer fast twitching brain cells that I have now and is in a bind and needs a clear explanation of how to navigate in a sea of memory addresses.

So to those who are more seasoned, this may not be for you and you may know a bunch more tricks to make this all simpler and for you god created commenting engines.

The problem

So here is the scenario. You notice that your windows nodes under Chef management are not converging. Maybe, they are supposed to converge every few minutes but they have not converged for hours.

So you figure something has happened to cause my nodes to fail and expect to find chef client logs showing chef run after run with the same failures. Nothing new, you will debug the error and fix it. So you open the log and see the latest chef run has just started a new run. Then you look at the time stamp and notice it started hours ago but never moved past finding its runlist. Its just sitting there hung.

No error message to troubleshoot. There is something wrong but the data stops there. What do you do?

Two words...windbg my friend...windbg. Ok its not two words but kind of has a two word ring to it.

What is this windbg?

Windbg is many things really. At its core, its simply a debugger and many use it to attach to live processes and step through execution. That's not usually a good idea when troubleshooting a multithreaded application like a web site but may not be bad for a chef run. However, I have never used it in this manner.

Another very popular usage is to take a snapshot of a process, also called a memory dump, and use it to deeply examine exactly what was going on in the system at that point in time.

The great thing is that this snapshot is very complete. It has access to all memory in the process, all threads and all stack traces. However the rub is that it is very raw data. Its just raw memory, a bunch of addresses pointers and hex values that may cause more confusion than help.

There are several commands to help sort out the sea of data but its likely far less familiar, intuitive or efficient than your day to day dev tools.

This is one reason why I write this post and why I wrote my last windbg post, I can never remember this stuff and the act of committing my immediate memory to writing and having a permanent record of this learning will help me when I inevitably have another similar problem in the future.

Taking a dump

What?!

Oh...god...no.

Thats really what we call this. Seriously. I sit with straight faced, well paid, professional adults and ask them to take dumps and give them to me.

Oh stop it.

Seriously though, this is the first stumbling point in the debugging endeavor. There are several kinds of memory dumps (crash dumps, hang dumps, minidumps, user mode dumps, kernel dumps, etc), each have their merits and there are different ways to obtain them and some means are more obscure than others.

For debugging ruby hangs, we generally just need a user mode dump of the ruby.exe process. This is not going to be a thorough discussion on all the different types of dumps and the means to produce them but I will cover a couple options.

Task Scheduler

In recent versions of windows, they come equipped with a crash dump generation tool that anyone can easily access. Simply right click on the process you want to examine and then select "create dump file", this generates a user mode dump file of the selected process. There are a couple downsides to collecting dumps in this manner:

1. These dumps do not include the handle table and therefore any use of the !handle command in windbg will fail.

2. On a 64 bit machine, unless you explicitly invoke the 32 bit task scheduler, you will get a 64 bit dump even of 32 bit processes. This is not a big deal really and we'll talk about how to switch to 32 bit mode in a 64 bit dump.

ProcDump

There is a sysinternals tool, ProcDump, that can be used to generate a dump. This tool allows you to set all kinds of thresholds in order to capture a dump file at just the right time based on CPU or memory pressure as well as other factors. You can also simply capture a dump immediately. I typically run:

procdump -ma <PID> <path of dump file>

So in the event of analyzing a ruby process dump, I simply give it the process id of ruby.exe. This will include the handle table and is smart enough to produce a 32 bit dump for 32 bit processes.

There are plenty of other ways to get dumps. I used to use AdPlus.vbs years ago and I *think* that will still work and ships with the debugging tools for windows that also includes windbg. However you get your dump file, it will end in a .dmp extension. If supplying a output path to procdump or other generator, make sure you use that extension when naming the file.

When to take the dump

In this post we are discussing hang scenarios. So when analyzing hangs, take the dump when your process is hanging. If we were analyaing a CPU pinning scenario, we would take the dump when the process was pinning the cpu. This can be easier said than done especially in production environments where hangs and pins may not surface on demand and may take hours or days to occur after a process starts. If you find yourself in such a situation where its tricky to generate the dump file at just the right time, have a look at the various procdump switches that can wait for just the right moment based on several different triggers.

Simulating high memory pressure

In the scenario described here, one of the hang events was preceded by high memory pressure on the chef node as seen from various flavors of OutOfMemoryExceptions on preceding chef runs. To reproduce the scenario, I needed to create an environment where there was little memory available on the box. For the benefit of my future self and others, here is how I did that:

  1. Created a VM with relatively low memory. 512MB was adequate in my case.
  2. Shrunk the paging file down to about 128MB
  3. Used the sysinternals tool, TestLimit to artificially leak memory on the box running: TestLimit64.exe -d 1 -c

At this point based on analyzing previous dumps, I knew exactly where in the chef run the problem was occurring so I just created a little ruby program to run the problem call in a loop until the ruby process hanged.

Getting windbg

Windbg is included in Microsoft's debugging toolset which is included in the Windows SDK. From the microsoft download site, you can download the entire SDK or choose only the debugging tools.

Another option is chocolatey. I maintain the windbg chocolatey package and this package will download and install the windows debugging tools and does a few extra helpful things:

  1. Adds a microsoft symbol server path environment variable that windbg will pick up. Assigning the path symsrv*symsrv.dll*f:\localsymbols*http://msdl.microsoft.com/download/symbols to the _NT_SYMBOL_PATH variable. This is important if you want to be able to make much sense of the dump files. Without symbols, you are left to raw memory addresses with no specific information on the structure and calling patterns of key classes and structures in the call stack.
  2. Copies the sos.dll from your local .net installation to the same folder where windbg is installed. This comes in handy for managed code debugging not covered here.
  3. Adds separate shortcut icons for the 64 bit and 32 bit versions of windbg. It can be important depending on the type of process dumped, to use the corresponding version of windbg. It really does not mater so much for this post.

Simply run:

choco install windbg -y

to install the chocolatey windbg package. I do this on all of my dev machines.

Analyzing dump files

So now you have the tools, windbg, you need and a dump file of a problem process to analyze. The remainder of this post will cover how to make sense of the dump with windbg. I will by no means be covering all the commands possible. In fact I'm just gonna cover a handful which should be all you need but I'd encourage anyone to at least browse over this helpful command summary since there very well may be a command that is better suited for your scenario. I'd also point out that I am NOT a windbg guru and if you see a command or strategy I am missing here that you think would be helpful, please provide a comment.

Loading the dump file

Once windbg is open, navigate to File -> Open Crash Dump and browse to the .DMP file you captured. I know...your process may not have actually "crashed", but this is the correct menu option.

So now your dump is loaded into windbg and the entire memory heap of the process is at your fingertips. Here is the typical path I walk at this point under unmanaged scenarios like a ruby hang.

Which thread do I care about?

Chances are nearly certain that there is more than a single thread captured in the dump and in fact depending on the nature of the app, there can be quite a few. However it is likely you only care about one. At least that was the case in my scenario. I want to examine the thread that is hung.

I'll start by running !runaway. Not because thats the thing I want to do most at this time. Rather this will list all threads and how much user time they have consumed:

0:000> !runaway
 User Mode Time
  Thread       Time
   0:734       0 days 0:20:11.843
   1:94        0 days 0:00:00.078

Here there are only two threads. I am more interested in the one that was running for 20 minutes: thread 0. Now the times displayed here may be lower, perhaps dramatically lower than you expect. My process had been running for a couple hours, what's up with 20 minutes!? Its even possible that your process has been hung for days, and the time displayed here will be a mere milliseconds.

Its important to note that this is "User Time" and is the amount of time that the thread was actually working. To get actual "elapsed time" you can add a parameter to !runaway. I'll run !runaway 7 to get user, kernel and elapsed time:

0:000> !runaway 7
 User Mode Time
  Thread       Time
   0:734       0 days 0:20:11.843
   1:94        0 days 0:00:00.078
 Kernel Mode Time
  Thread       Time
   0:734       0 days 0:13:19.437
   1:94        0 days 0:00:00.031
 Elapsed Time
  Thread       Time
   0:734       0 days 2:58:53.443
   1:94        0 days 2:58:53.412

This is from the dump where I artificially created the high memory pressure environment and the same ruby thread was looping for hours. Of the 3 hours it was looping, the thread itself was only doing work for 20 minutes. Some of the production dumps I looked at had a much higher elapsed time (days) and lower user time (just milliseconds) because Chef had started a ruby process that immediately invoked a blocking call to an external process and then waited for days, during which time it did no work. Lazy bum.

Call stacks

Now lets look at some call stacks. Even though I'm pretty sure at this point that thread 0 is the one I'm interested in, I'll fist browse the stack of all threads using: ~*k

0:000> ~*k

.  0  Id: 90c.734 Suspend: 0 Teb: 7ffdd000 Unfrozen
ChildEBP RetAddr  
0028ea2c 7746112f ntdll!NtWaitForMultipleObjects+0xc
0028ebb8 752dd433 KERNELBASE!WaitForMultipleObjectsEx+0xcc
0028ec1c 76f2fab4 user32!MsgWaitForMultipleObjectsEx+0x163
0028ec54 7702b50c combase!CCliModalLoop::BlockFn+0x111 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1571]
(Inline) -------- combase!ModalLoop+0x52 [d:\9147\com\combase\dcomrem\chancont.cxx @ 129]
(Inline) -------- combase!ClassicSTAThreadWaitForCall+0x52 [d:\9147\com\combase\dcomrem\threadtypespecific.cpp @ 172]
0028ecc0 7702a259 combase!ThreadSendReceive+0x1d3 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5776]
(Inline) -------- combase!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xd7 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5090]
0028ee0c 76f2fe0d combase!CRpcChannelBuffer::SendReceive2+0x1e9 [d:\9147\com\combase\dcomrem\channelb.cxx @ 4796]
(Inline) -------- combase!ClientCallRetryContext::SendReceiveWithRetry+0x31 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1090]
(Inline) -------- combase!CAptRpcChnl::SendReceiveInRetryContext+0x3b [d:\9147\com\combase\dcomrem\callctrl.cxx @ 715]
0028eecc 76f0c65d combase!ClassicSTAThreadSendReceive+0x21d [d:\9147\com\combase\dcomrem\callctrl.cxx @ 696]
(Inline) -------- combase!CAptRpcChnl::SendReceive+0x89 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 752]
0028ef30 7702a010 combase!CCtxComChnl::SendReceive+0x105 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 790]
0028ef54 76dc5769 combase!NdrExtpProxySendReceive+0x5c [d:\9147\com\combase\ndr\ndrole\proxy.cxx @ 2017]
0028ef6c 76e46c1b rpcrt4!NdrpProxySendReceive+0x29
0028f398 77029e1e rpcrt4!NdrClientCall2+0x22b
0028f3b8 76f0c46f combase!ObjectStublessClient+0x6c [d:\9147\com\combase\ndr\ndrole\i386\stblsclt.cxx @ 215]
0028f3c8 7494a5c6 combase!ObjectStubless+0xf [d:\9147\com\combase\ndr\ndrole\i386\stubless.asm @ 171]
0028f440 74d0627f fastprox!CEnumProxyBuffer::XEnumFacelet::Next+0xd6
0028f484 76cca040 wbemdisp!CSWbemObjectSet::get_Count+0xdf
0028f4a0 76cca357 oleaut32!DispCallFunc+0x16f
0028f748 74cfe81f oleaut32!CTypeInfo2::Invoke+0x2d7
0028f784 74d0876e wbemdisp!CDispatchHelp::Invoke+0xaf
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for win32ole.so - 
0028f7b8 6ab901e2 wbemdisp!CSWbemDateTime::Invoke+0x3e
WARNING: Stack unwind information not available. Following frames may be wrong.
0028f8b8 6ab90bde win32ole+0x101e2
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for msvcrt-ruby200.dll - 
0028f988 668e3539 win32ole+0x10bde
0028fa18 668ef3f8 msvcrt_ruby200!rb_error_arity+0x129
0028fab8 668ef712 msvcrt_ruby200!rb_f_send+0x518
0028fb38 668e7fdd msvcrt_ruby200!rb_f_send+0x832
0028fc68 668ec1a4 msvcrt_ruby200!rb_vm_localjump_error+0x260d
0028fd78 668f4f71 msvcrt_ruby200!rb_vm_localjump_error+0x67d4
0028fdd8 667c4651 msvcrt_ruby200!rb_iseq_eval_main+0x131
0028fe58 667c6b3d msvcrt_ruby200!rb_check_copyable+0x37d1
*** ERROR: Module load completed but symbols could not be loaded for ruby.exe
0028fe88 0040287f msvcrt_ruby200!ruby_run_node+0x2d
0028feb8 004013fa ruby+0x287f
0028ff80 77087c04 ruby+0x13fa
0028ff94 7765ad1f kernel32!BaseThreadInitThunk+0x24
0028ffdc 7765acea ntdll!__RtlUserThreadStart+0x2f
0028ffec 00000000 ntdll!_RtlUserThreadStart+0x1b

   1  Id: 90c.94 Suspend: 0 Teb: 7ffda000 Unfrozen
ChildEBP RetAddr  
022cfec8 77452cc7 ntdll!NtWaitForSingleObject+0xc
022cff3c 77452c02 KERNELBASE!WaitForSingleObjectEx+0x99
022cff50 668fe676 KERNELBASE!WaitForSingleObject+0x12
WARNING: Stack unwind information not available. Following frames may be wrong.
022cff80 77087c04 msvcrt_ruby200!rb_thread_list+0x10d6
022cff94 7765ad1f kernel32!BaseThreadInitThunk+0x24
022cffdc 7765acea ntdll!__RtlUserThreadStart+0x2f
022cffec 00000000 ntdll!_RtlUserThreadStart+0x1b

OK, so the ~*k semantics were probably obvious but remember we are 2. Lets walk through each character:

~ - Typing this alone would have simply listed 2 lines: each thread and their status.

* - This returned all threads. We could have supplied a number instead to get a single thread like ~0k to get just thread 0.

k - this produced the call stack.

There are more single character commands we can use after the 'k' each equally intuitive and discoverable. We will look at a couple in just a bit.

So here we have call stacks for each thread with a line for each frame in the stack. One can kind of groc the gist of the execution here: a bunch of ruby C calls and then an RPC call is sent and it waits for that call to return. The two addresses at the front of each frame are likely not much help for us. They refer to the base pointer of the frame and the return address where execution is to resume.

So now you are dissapointed. You were probably thinking you'd be seeing an actual ruby stack trace. Yeah. That would be cool. So one challenge we face is finding what actual ruby code, perhaps that we wrote, is at play here. However, if you find yourself needing to debug at this level, there are things happening outside of ruby that are of real interest.

Another thing to note is that the ruby frames seem brief compared to the COM frames above. Later we'll see how examining the COM is much easier than ruby. This is because we lack symbols for the compiled ruby runtime. I have not tried it myself, but we could compile ruby ourselves and prooduce symbols then capture a dump running on that ruby. This would likely be helpful but not practical in a production environment.

Switching to the correct bitness

Maybe you are seeing nothing like this at all. If you are seeing something like this:

  0  Id: 90c.734 Suspend: 0 Teb: 00000000`7ffdb000 Unfrozen
Child-SP          RetAddr           Call Site
00000000`0008ee08 00000000`77581e66 wow64cpu!CpupSyscallStub+0x2
00000000`0008ee10 00000000`7754219a wow64cpu!WaitForMultipleObjects32+0x1d
00000000`0008eec0 00000000`775420d2 wow64!RunCpuSimulation+0xa
00000000`0008ef10 00007ffc`7f383a15 wow64!Wow64LdrpInitialize+0x172
00000000`0008f450 00007ffc`7f362f1e ntdll!LdrpInitializeProcess+0x1591
00000000`0008f770 00007ffc`7f2d8ebe ntdll!_LdrpInitialize+0x8a00e
00000000`0008f7e0 00000000`00000000 ntdll!LdrInitializeThunk+0xe

This is what a 32 bit thread looks like in a 64 bit dump. Thats ok. You can switch inside of this same dump to 32 bit mode by running:

!wow64exts.sw

Now run ~0k and you will see a stack like the larger one before.

Going deeper into the stack

So I'll be honest, the ruby frames here make no sense to me. This may be due to the lack of symbols but the last ruby frame would indicate that an argument error is being thrown. However it is called by rb_f_send which does not call rb_error_arity according to the ruby source and rb_error_arity does not call into win32ole. So I don't know whats going on here. Now I do know that the ruby win32ole class manages WMI queries and I also know that wbem is the prefix of the wmi scripting library in windows. Further, WMI queries are handled out of process which would explain the RPC call. In this case, the hang is occuring in a Chef client run but before the cookbook recipes actually run. This is when chef "ohai" data is collected which is the data that describes the current state of a machine and windows nodes leverage WMI for much of that data.

So is there any more info that might lend us some clues here? Well one step we can take is to query the thread 0 stack again and this time try to get the actual parameters being passed into the function calls. Thats done by adding...wait...I wont tell...you guess...

...thats right! Of course...'b'. We add b to the ~0k command so we run ~0kb:

0:000> ~0kb
ChildEBP RetAddr  Args to Child              
0028ea2c 7746112f 00000002 0028ebe0 00000001 ntdll!NtWaitForMultipleObjects+0xc
0028ebb8 752dd433 00000002 0028ebe0 00000000 KERNELBASE!WaitForMultipleObjectsEx+0xcc
0028ec1c 76f2fab4 00000001 0028ec80 ffffffff user32!MsgWaitForMultipleObjectsEx+0x163
0028ec54 7702b50c 0028ec80 00000001 0028ec7c combase!CCliModalLoop::BlockFn+0x111 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1571]
(Inline) -------- -------- -------- -------- combase!ModalLoop+0x52 [d:\9147\com\combase\dcomrem\chancont.cxx @ 129]
(Inline) -------- -------- -------- -------- combase!ClassicSTAThreadWaitForCall+0x52 [d:\9147\com\combase\dcomrem\threadtypespecific.cpp @ 172]
0028ecc0 7702a259 0028ef84 005c9c54 00505170 combase!ThreadSendReceive+0x1d3 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5776]
(Inline) -------- -------- -------- -------- combase!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xd7 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5090]
0028ee0c 76f2fe0d 005c9c54 0028ef84 0028ef4c combase!CRpcChannelBuffer::SendReceive2+0x1e9 [d:\9147\com\combase\dcomrem\channelb.cxx @ 4796]
(Inline) -------- -------- -------- -------- combase!ClientCallRetryContext::SendReceiveWithRetry+0x31 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1090]
(Inline) -------- -------- -------- -------- combase!CAptRpcChnl::SendReceiveInRetryContext+0x3b [d:\9147\com\combase\dcomrem\callctrl.cxx @ 715]
0028eecc 76f0c65d 005c9c54 0028ef84 0028ef4c combase!ClassicSTAThreadSendReceive+0x21d [d:\9147\com\combase\dcomrem\callctrl.cxx @ 696]
(Inline) -------- -------- -------- -------- combase!CAptRpcChnl::SendReceive+0x89 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 752]
0028ef30 7702a010 005c9c54 0028ef84 0028ef4c combase!CCtxComChnl::SendReceive+0x105 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 790]
0028ef54 76dc5769 02b0f2c4 0028efb0 76dc5740 combase!NdrExtpProxySendReceive+0x5c [d:\9147\com\combase\ndr\ndrole\proxy.cxx @ 2017]
0028ef6c 76e46c1b 29b71661 76f113e0 0028f3d0 rpcrt4!NdrpProxySendReceive+0x29
0028f398 77029e1e 74ba39b0 74ba4ce0 0028f3d0 rpcrt4!NdrClientCall2+0x22b
0028f3b8 76f0c46f 0028f3d0 00000003 0028f440 combase!ObjectStublessClient+0x6c [d:\9147\com\combase\ndr\ndrole\i386\stblsclt.cxx @ 215]
0028f3c8 7494a5c6 02b0f2c4 005bc634 ffffffff combase!ObjectStubless+0xf [d:\9147\com\combase\ndr\ndrole\i386\stubless.asm @ 171]
0028f440 74d0627f 005bc690 ffffffff 00000001 fastprox!CEnumProxyBuffer::XEnumFacelet::Next+0xd6
0028f484 76cca040 02b7b668 0028f4f8 0051197c wbemdisp!CSWbemObjectSet::get_Count+0xdf
0028f4a0 76cca357 02b7b668 00000024 00000004 oleaut32!DispCallFunc+0x16f
0028f748 74cfe81f 004f54ec 02b7b668 00000001 oleaut32!CTypeInfo2::Invoke+0x2d7
0028f784 74d0876e 02b7b678 00000001 6ab991f4 wbemdisp!CDispatchHelp::Invoke+0xaf
0028f7b8 6ab901e2 02b7b668 00000001 6ab991f4 wbemdisp!CSWbemDateTime::Invoke+0x3e
WARNING: Stack unwind information not available. Following frames may be wrong.
0028f8b8 6ab90bde 00000003 00000000 01dc6590 win32ole+0x101e2
0028f988 668e3539 00000001 00433098 02386500 win32ole+0x10bde
0028fa18 668ef3f8 0262556e 02487ce8 00000015 msvcrt_ruby200!rb_error_arity+0x129
0028fab8 668ef712 0043308c 00433088 00000004 msvcrt_ruby200!rb_f_send+0x518
0028fb38 668e7fdd 00375750 004b2f30 02419480 msvcrt_ruby200!rb_f_send+0x832
0028fc68 668ec1a4 01d5af70 00000004 02487ed5 msvcrt_ruby200!rb_vm_localjump_error+0x260d
0028fd78 668f4f71 00000001 00000000 01dda1b8 msvcrt_ruby200!rb_vm_localjump_error+0x67d4
0028fdd8 667c4651 02522f58 00000000 ffffffff msvcrt_ruby200!rb_iseq_eval_main+0x131
0028fe58 667c6b3d 0028fe7c 00000000 0028fe88 msvcrt_ruby200!rb_check_copyable+0x37d1
0028fe88 0040287f 02522f58 00372cb8 0028ff80 msvcrt_ruby200!ruby_run_node+0x2d
0028feb8 004013fa 00000002 00372cb8 00371550 ruby+0x287f
0028ff80 77087c04 7ffde000 77087be0 28433cec ruby+0x13fa
0028ff94 7765ad1f 7ffde000 28241b9c 00000000 kernel32!BaseThreadInitThunk+0x24
0028ffdc 7765acea ffffffff 7764024a 00000000 ntdll!__RtlUserThreadStart+0x2f
0028ffec 00000000 004014e0 7ffde000 00000000 ntdll!_RtlUserThreadStart+0x1b

This adds three more memory addresses to each frame and they point to the first three arguments passed to the function on the stack.

So these may be hex representations of actual values of pointers to other data structures. If we have symbols for the code being called, things get alot easier but lets look at a ruby call first. I'd like to check out the args passed to the last ruby call in the stack.

Examining raw address values

So the command to display memory in windbg is 'd'. I know, thats really kind of a given but remember we are 2. Also, its not just d. d is always accompanied by another character to let windbg know how to display that value. Here are some I find myself using alot:

  • dd - DWORD (4bytes) by default it spits out 32 of them but you can limit the number with the L parameter (ex. dd 7746112f L2)
  • dc - Same output as dd but adds ascii representation to the right
  • da - displays the address memory as a ascii string
  • du - displays the memory as a unicode string

There is another one, dt, that is super helpful and we will look at that in just a bit.

So one thing I am curious about here are the arguments passed to the last ruby frame:

0028fa18 668ef3f8 0262556e 02487ce8 00000015 msvcrt_ruby200!rb_error_arity+0x129

Remember, its the 2nd - 4th addresses that contain the first three arguments. You always see 3 addresses regardless of the number of arguments the method takes so these may point to nowhere useful. Looking at the first argument dc 0262556e produces:

0:000> dc 0262556e 
0262556e  336e6977 6e705f32 67697370 6464656e  win32_pnpsignedd
0262557e  65766972 26340072 5f564552 335c3130  river.4&REV_01\3
0262558e  37363226 36313641 26332641 00003833  &267A616A&3&38..
0262559e  3d740241 1b8612c8 43508000 45565c49  A.t=......PCI\VE
026255ae  30385f4e 44263638 375f5645 26323931  N_8086&DEV_7192&
026255be  53425553 305f5359 30303030 26303030  SUBSYS_00000000&
026255ce  5f564552 335c3330 37363226 36313641  REV_03\3&267A616
026255de  26332641 00003030 3d7d0000 1c011331  A&3&00....}=1...

OK, this is interesting and curious. win32_pnpsigneddriver is a valid WMI class. However, this would not be a valid argument for rb_error_arity which takes 3 ints. So now I'm just dismissing the entire ruby stack to lack of debug symbols, but the I feel confident a WMI call is a factor. Here is another clue: the last argument on the second win32ole frame:

0:000> dc 02386500 
02386500  0000000c 02361db0 00000000 6ab834f0  ......6......4.j
02386510  0241c9d8 00000000 09402005 01d5a820  ..A...... @. ...
02386520  00000023 02571888 00000023 00000000  #.....W.#.......
02386530  0000a007 01d528a0 02386560 00000000  .....(..`e8.....
02386540  00000000 00000000 09424005 01d5a820  .........@B. ...
02386550  63657845 72657551 00000079 00000000  ExecQuery.......
02386560  00502005 01d5a820 00000023 02625560  . P. ...#...`Ub.
02386570  0000003e 00000000 08202005 01d5a820  >........  . ...

ExecQuery is the WMI method name that performs queries on WMI classes.

So searching the chef ohai gem I come accross:

drivers = wmi.instances_of('Win32_PnPSignedDriver')

This instances_of method uses the wmi-lite gem calling ExecQuery on the locator returned from:

locator = WIN32OLE.new("WbemScripting.SWbemLocator")

So now I have a pretty good idea what application code is triggering the hang. As I stated earlier in this post, WMI queries create an RPC call to a process named WmiPrvSE.exe. Its not uncommon to see more than one of these processes on a machine. So now I'm wondering if it might be possible to track which of these processes this thread is waiting on. This gets us to a couple other useful windbg commands.

Navigating types

So we do have symbols for the COM code handling the RPC message. That being the case, there is another stack trace command we can use to get us the parameters in a more friendly way - ~0kP

0:000> ~0kP
ChildEBP RetAddr  
0028ea2c 7746112f ntdll!NtWaitForMultipleObjects+0xc
0028ebb8 752dd433 KERNELBASE!WaitForMultipleObjectsEx+0xcc
0028ec1c 76f2fab4 user32!MsgWaitForMultipleObjectsEx+0x163
0028ec54 7702b50c combase!CCliModalLoop::BlockFn(
            void ** ahEvent = 0x0028ec80, 
            unsigned long cEvents = 1, 
            unsigned long * lpdwSignaled = 0x0028ec7c)+0x111 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1571]
(Inline) -------- combase!ModalLoop+0x52 [d:\9147\com\combase\dcomrem\chancont.cxx @ 129]
(Inline) -------- combase!ClassicSTAThreadWaitForCall+0x52 [d:\9147\com\combase\dcomrem\threadtypespecific.cpp @ 172]
0028ecc0 7702a259 combase!ThreadSendReceive(
            class CMessageCall * pCall = 0x00505170, 
            struct _GUID * rmoid = 0x00509fb4 {70622196-bea4-9302-ac70-cabc069d1e23})+0x1d3 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5776]
(Inline) -------- combase!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xd7 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5090]
0028ee0c 76f2fe0d combase!CRpcChannelBuffer::SendReceive2(
            struct tagRPCOLEMESSAGE * pMessage = 0x0028ef84, 
            unsigned long * pstatus = 0x0028ef4c)+0x1e9 [d:\9147\com\combase\dcomrem\channelb.cxx @ 4796]
(Inline) -------- combase!ClientCallRetryContext::SendReceiveWithRetry+0x31 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1090]
(Inline) -------- combase!CAptRpcChnl::SendReceiveInRetryContext+0x3b [d:\9147\com\combase\dcomrem\callctrl.cxx @ 715]
0028eecc 76f0c65d combase!ClassicSTAThreadSendReceive(
            class CAptRpcChnl * pChannel = 0x005c9c54, 
            struct tagRPCOLEMESSAGE * pMsg = 0x0028ef84, 
            unsigned long * pulStatus = 0x0028ef4c)+0x21d [d:\9147\com\combase\dcomrem\callctrl.cxx @ 696]
(Inline) -------- combase!CAptRpcChnl::SendReceive+0x89 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 752]
0028ef30 7702a010 combase!CCtxComChnl::SendReceive(
            struct tagRPCOLEMESSAGE * pMessage = 0x0028ef84, 
            unsigned long * pulStatus = 0x0028ef4c)+0x105 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 790]
0028ef54 76dc5769 combase!NdrExtpProxySendReceive(
            void * pThis = 0x02b0f2c4, 
            struct _MIDL_STUB_MESSAGE * pStubMsg = 0x0028efb0)+0x5c [d:\9147\com\combase\ndr\ndrole\proxy.cxx @ 2017]
0028ef6c 76e46c1b rpcrt4!NdrpProxySendReceive+0x29
0028f398 77029e1e rpcrt4!NdrClientCall2+0x22b
0028f3b8 76f0c46f combase!ObjectStublessClient(
            void * ParamAddress = 0x0028f3d0, 
            long Method = 0n3)+0x6c [d:\9147\com\combase\ndr\ndrole\i386\stblsclt.cxx @ 215]
0028f3c8 7494a5c6 combase!ObjectStubless(void)+0xf [d:\9147\com\combase\ndr\ndrole\i386\stubless.asm @ 171]
0028f440 74d0627f fastprox!CEnumProxyBuffer::XEnumFacelet::Next+0xd6
0028f484 76cca040 wbemdisp!CSWbemObjectSet::get_Count+0xdf
0028f4a0 76cca357 oleaut32!DispCallFunc+0x16f
0028f748 74cfe81f oleaut32!CTypeInfo2::Invoke+0x2d7
0028f784 74d0876e wbemdisp!CDispatchHelp::Invoke+0xaf
0028f7b8 6ab901e2 wbemdisp!CSWbemDateTime::Invoke+0x3e
WARNING: Stack unwind information not available. Following frames may be wrong.
0028f8b8 6ab90bde win32ole+0x101e2
0028f988 668e3539 win32ole+0x10bde
0028fa18 668ef3f8 msvcrt_ruby200!rb_error_arity+0x129
0028fab8 668ef712 msvcrt_ruby200!rb_f_send+0x518
0028fb38 668e7fdd msvcrt_ruby200!rb_f_send+0x832
0028fc68 668ec1a4 msvcrt_ruby200!rb_vm_localjump_error+0x260d
0028fd78 668f4f71 msvcrt_ruby200!rb_vm_localjump_error+0x67d4
0028fdd8 667c4651 msvcrt_ruby200!rb_iseq_eval_main+0x131
0028fe58 667c6b3d msvcrt_ruby200!rb_check_copyable+0x37d1
0028fe88 0040287f msvcrt_ruby200!ruby_run_node+0x2d
0028feb8 004013fa ruby+0x287f
0028ff80 77087c04 ruby+0x13fa
0028ff94 7765ad1f kernel32!BaseThreadInitThunk+0x24
0028ffdc 7765acea ntdll!__RtlUserThreadStart+0x2f
0028ffec 00000000 ntdll!_RtlUserThreadStart+0x1b

Maybe still not the most beautiful output, but look at those combase calls. We have types and parameter names! Amazing! So to get the process ID of the service handling the call, we want to look at the CMessageCall passed to ThreadSendReceive. We can dig into CMessageCall using the dt command. It takes the symbol to inspect and its address:

0:000> dt CMessageCall 0x00505170
Ambiguous matches found for CMessageCall (dumping largest sized):
    combase!CMessageCall {0x004 bytes}
    combase!CMessageCall {0x100 bytes}
combase!CMessageCall
   +0x000 __VFN_table : 0x76efb4c0 
   =7703e09c s_uCurrentCallTraceId : 0x23ac7b
   +0x004 __VFN_table : 0x76efb53c 
   +0x008 __VFN_table : 0x76efb52c 
   +0x00c _pNextMessage    : (null) 
   =76ef0000 Kind             : 0x905a4d (No matching name)
   +0x010 _callcat         : 1 ( CALLCAT_SYNCHRONOUS )
   +0x014 _iFlags          : 0x202
   +0x018 _hResult         : 0n0
   +0x01c _hEvent          : 0x00000260 Void
   +0x020 _hWndCaller      : (null) 
   +0x024 _ipid            : _GUID {0009042d-030c-0000-5ed7-a7ed59ca66b0}
   +0x034 _hSxsActCtx      : 0xffffffff Void
   +0x038 _server_fault    : 0
   +0x03c _destObj         : CDestObject
   +0x048 _pHeader         : 0x005bfdb8 Void
   +0x04c _pLocalThis      : 0x005bfdd8 WireLocalThis
   +0x050 _pLocalThat      : (null) 
   +0x054 _pHandle         : 0x005941a0 CChannelHandle
   +0x058 _hRpc            : 0x02b68f18 Void
   +0x05c _pContext        : (null) 
   +0x060 _dwCOMHeaderSize : 0x30
   +0x064 message          : tagRPCOLEMESSAGE
   +0x090 _iidWire         : _GUID {423ec01e-2e35-11d2-b604-00104b703efd}
   +0x0a0 _guidCausality   : _GUID {1a786d47-0267-499f-b9e9-6678d3e29858}
   +0x0b0 _dwServerPid     : 0x30c
   +0x0b4 _dwServerTid     : 0
   +0x0b8 _iMethodWire     : 3
   +0x0bc _dwErrorBufSize  : 0
   +0x0c0 _pASTAIncomingCallData : (null) 
   +0x0c4 m_ulCancelTimeout : 0xffffffff
   +0x0c8 m_dwStartCount   : 0xfb0049
   +0x0d0 m_pClientCtxCall : (null) 
   +0x0d4 m_pServerCtxCall : (null) 
   +0x0d8 _bCheckedClientIsAppContainer : 0y0
   +0x0d8 _bClientIsAppContainer : 0y0
   +0x0d8 _bServerIsAppContainer : 0y0
   +0x0d8 _bCheckedClientIsRpcss : 0y0
   +0x0d8 _bClientIsRpcss  : 0y0
   +0x0d8 _bCheckedClientIsStrongNamedProcess : 0y0
   +0x0d8 _bClientIsStrongNamedProcess : 0y0
   +0x0d8 _bClientCalledGetBuffer : 0y1
   +0x0d9 _bClientCalledSendOrSendReceive : 0y1
   +0x0d9 _bClientCalledReceiveOrSendReceive : 0y1
   +0x0d9 _bClientReceiveOrSendReceiveCompletedSuccessfully : 0y0
   +0x0d9 _bSuppressClientStopTracing : 0y0
   +0x0dc _uCallTraceId    : 0x23ac7b
   +0x0e0 _hrClient        : 0
   +0x0e4 _sourceOfHResult : 0 ( Unknown )
   +0x0e8 _pInBiasContainer : (null) 
   +0x0ec _pOutBiasContainer : (null) 
   +0x0f0 _pushTlsMitigation : PushTlsPreventRundownMitigation
   +0x0f8 _bEnabledInprocOutParameterMitigation : 0y0
   +0x0f8 _bCheckedConditionsForOutofprocOutParameterMitigation : 0y0
   +0x0f8 _bShouldEnableOutofprocOutParameterMitigation : 0y0
   +0x0f8 _bScheduledAcknowlegmentOfOutParameterMarshalingSet : 0y0
   +0x0fc _pServerOXIDEntry : 0x004fbe68 OXIDEntry

So the key property here is _dwServerPid with a value of 0x30c, which translates to a decimal value of 780. This was a svchost.exe process which can host several windows services. This process was hosting the Windows Management Instrumentation service.

To be honest at this point my energy to dive into that process to see if I could find which spawned WmiPrvSE.exe was at fault had fizzled. There had been alot of other troubleshooting beyond what we have looked at here leading me to believe that there was likely nothing helpful on the WmiPrvSE side that would be useful. The actionable bit of data I could walk away from here is to include a timeout in the wmi-lite query so that if WMI queries hang in the future, the process will at least fail allowing subsequent chef runs to proceed.

Before ending this post, I would like to cover a couple other points that may be helpful in other investigations.

Waiting handles

One dead giveaway from this stack that things are stuck is the WaitForMultipleObjects call at the very top.

0028ea2c 7746112f 00000002 0028ebe0 00000001 ntdll!NtWaitForMultipleObjects+0xc

Just google this name and you will find reams of posts, some well worth the read, talking about hangs. We can even examine the waiting handle. Its the second parameter sent to that call. First we need the handle address that argument points to. We will use dd 0028ebe0 L1 to get the very first dword:

0:000> dd 0028ebe0 L1
0028ebe0  00000260

This gives us the handle address which we retrive using the !handle command:

0:000> !handle 00000260  f
Handle 00000260
  Type          Event
  Attributes    0
  GrantedAccess 0x1f0003:
         Delete,ReadControl,WriteDac,WriteOwner,Synch
         QueryState,ModifyState
  HandleCount   2
  PointerCount  35085
  Name          <none>
  Object specific information
    Event Type Auto Reset
    Event is Waiting

You definitely want to use the f parameter other wise you just get the handle type. So we see here the thread is waiting.

Locks, lock owners and orphans

In another hang I had some dumps of the WmiPrvSE processes on the machine at the time that ruby was hanging. In this dump was the following thread:

   3  Id: 1dd0.293c Suspend: 0 Teb: 7f28f000 Unfrozen
ChildEBP RetAddr  Args to Child              
0385dfa4 7724df87 00000380 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
0385e018 7724de0b 00000001 0366e230 0366e260 ntdll!RtlpWaitOnCriticalSection+0xd0
0385e040 7724de35 0385e064 73f33450 0366e260 ntdll!RtlpEnterCriticalSectionContended+0xa0
0385e048 73f33450 0366e260 02e1bb48 0366e230 ntdll!RtlEnterCriticalSection+0x42
0385e064 73f33721 3a369589 00000000 00000000 signdrv!CWhqlObj::WalkTree+0x18
0385e930 73f3364d 02df2da8 02e1bb48 00000000 signdrv!CWhqlObj::CreateList+0x4b
0385e97c 00106d8a 0366e238 02e1b394 00000000 signdrv!CWhqlObj::CreateInstanceEnumAsync+0xdd
0385e9d8 00106a26 00000000 02dc50b4 00000000 WmiPrvSE!CInterceptor_IWbemSyncProvider::Helper_CreateInstanceEnumAsync+0x2c1
0385ea34 75c8676f 02dd8b70 02dc50b4 00000000 WmiPrvSE!CInterceptor_IWbemSyncProvider::CreateInstanceEnumAsync+0xf6
0385ea5c 75d071fc 00106930 0385eca8 00000005 rpcrt4!Invoke+0x34
0385f0e0 74fa868d 02dbd098 02dc4154 02db95b4 rpcrt4!NdrStubCall2+0x32c
0385f12c 75c6147c 02dbd098 02db95b4 02dc4154 combase!CStdStubBuffer_Invoke+0x99 [d:\9147\com\combase\ndr\ndrole\stub.cxx @ 1590]
0385f148 74072f08 02dbd098 02db95b4 02dc4154 rpcrt4!CStdStubBuffer_Invoke+0x2c
0385f164 74fa853f 02dbda6c 02db95b4 02dc4154 fastprox!CBaseStublet::Invoke+0x38
0385f1f0 74e8c086 02dbb228 000f3888 02dc4154 combase!SyncStubInvoke+0x14c [d:\9147\com\combase\dcomrem\channelb.cxx @ 1664]
(Inline) -------- -------- -------- -------- combase!StubInvoke+0x9e [d:\9147\com\combase\dcomrem\channelb.cxx @ 1957]
0385f31c 74fa9600 02dc4154 02db95b4 02dbda6c combase!CCtxComChnl::ContextInvoke+0x236 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 1377]
(Inline) -------- -------- -------- -------- combase!DefaultInvokeInApartment+0xffffffe8 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 2716]
0385f3c4 74fa8de0 02dbda6c 02dd8b70 02de2de0 combase!AppInvoke+0x415 [d:\9147\com\combase\dcomrem\channelb.cxx @ 1481]
0385f520 74fa964f 74fa9710 00000000 02dc0ed0 combase!ComInvokeWithLockAndIPID+0x605 [d:\9147\com\combase\dcomrem\channelb.cxx @ 2311]
0385f57c 75c867cd 02dc0ed0 3c14e0c5 00000000 combase!ThreadInvoke+0x694 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5488]
0385f5c0 75c8722e 74fa9710 02dc0ed0 0385f670 rpcrt4!DispatchToStubInCNoAvrf+0x4d
0385f630 75c86ff7 02dc0ed0 00000000 00000000 rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x13e
0385f6c4 75c86d3c 0385f6f4 00000000 02dc0e18 rpcrt4!LRPC_SCALL::DispatchRequest+0x226
0385f720 75c86a14 02de2d48 02de3e50 00000000 rpcrt4!LRPC_SCALL::HandleRequest+0x31c
0385f75c 75c8538f 02de2d48 02de3e50 00000000 rpcrt4!LRPC_SASSOCIATION::HandleRequest+0x1fc
0385f824 75c85101 00000000 75c85070 02d9a0c0 rpcrt4!LRPC_ADDRESS::ProcessIO+0x481
0385f860 77250b10 0385f8e4 02db39c4 02d9a0c0 rpcrt4!LrpcIoComplete+0x8d
0385f898 77250236 0385f8e4 02d9a0c0 00000000 ntdll!TppAlpcpExecuteCallback+0x180
0385fa38 755d7c04 02d981d8 755d7be0 3cb70784 ntdll!TppWorkerThread+0x33c
0385fa4c 7728ad1f 02d981d8 3ee82643 00000000 kernel32!BaseThreadInitThunk+0x24
0385fa94 7728acea ffffffff 7727022d 00000000 ntdll!__RtlUserThreadStart+0x2f
0385faa4 00000000 77244a00 02d981d8 00000000 ntdll!_RtlUserThreadStart+0x1b

This thread jumped out at me because it was waiting to obtain a lock to run a critical section. We can get more info on the critical section by using the !critsec command with the address of the first parameter passed to RtlEnterCriticalSection:

0:000> !critsec 0366e260

CritSec +366e260 at 0366e260
WaiterWoken        No
LockCount          1
RecursionCount     1
OwningThread       2f84
EntryCount         0
ContentionCount    1
*** Locked

We can also use the !locks command:

0:000> !locks

CritSec +366e260 at 0366e260
WaiterWoken        No
LockCount          1
RecursionCount     1
OwningThread       2f84
EntryCount         0
ContentionCount    1
*** Locked

Scanned 13 critical sections

So the kind of obvious next question one might ask themselves now is: what is thread 2f84 doing? Probably nothing productive. However, a listing of threads:

0:000> ~
.  0  Id: 1dd0.9b4 Suspend: 0 Teb: 7f3be000 Unfrozen
   1  Id: 1dd0.2c64 Suspend: 0 Teb: 7f3bb000 Unfrozen
   2  Id: 1dd0.3148 Suspend: 0 Teb: 7f289000 Unfrozen
   3  Id: 1dd0.293c Suspend: 0 Teb: 7f28f000 Unfrozen

shows that thread has gone missing. This probably means one of two things:

  1. We are in the middle of a context switch and owner 2f84 has finished.
  2. 2f84 encountered an access violation and died leaving our thread here orphaned. Tragic but very possible.

Coincidentally, more inspection of this thread shows that it was doing something with our WMI class friend Win32_PNPSignedDriver. This very well could have been why the ruby thread in this case was hanging. It was hard to say because other hangs did not reveal this same pattern.

So I know I've looked at just about enough memory dumps to last me a good while and I hope that this has been equally satisfying to other readers.

Oh...and...hi future self! I'm thinking about you and hoping you a quick investigation.