Ahhh windbg...those who are familiar with it know that you can't live with it and you can't live without it. Unless you are a hardened windows c++ dev (I am not), if you have used windbg, it was in a moment of desperation when all of the familiar tools just wouldn't cut it. So now whenever you hear the word "windbg", it conjours up the archetype of hard to use, arcane, engineering tooling. Step aside git and vi, windbg kicks your ass when it comes to commands that were intended to be forgotten. Yet here's the thing, windbg has saved your but. It allowed you to see things nothing else could show you. Did git or vi provide this value?...ok...maybe they did...but still!!
Many of us have personal windbg stories. These usually involve some horrendous production outage and are mixed with the drama of figuring out a tool that was not meant to just be picked up and mastered. I have several of these stories in my high volume web development days. Here are a couple:
- A frequently called function uses a new regex causing CPU to spike accross all web nodes.
- Web site becomes unresponsive because the asp.net thread pool is saturated with calls to a third party web service that is experiencing latency issues.
I survived these events and others thanks to windbg. Thanks windbg! But I still hate you and I think I always will.
windbg and ruby?
So why am I talking about windbg and ruby in the same post? Are they not from two different worlds that should remain apart? Well I have now had two separate incidents in the past few months where I have needed to use windbg with a problem in ruby code. Both of these incidents are similar in that they both differ from my previous run ins with windbg where I was debugging managed code and they both were related to a similar Chef related problem.
When it comes to ruby, its all unmanaged code and you may not know the difference between managed and unmanaged code in the windows world and that is totally ok, but if you do, I can tell you there are extensions you can use with windbg (like sos.dll) to make your life easier. Sorry, those do not apply here. Its all native from here on out!
I've blogged long ago about debugging native .net code with windbg, bit one, thats a super old post and two, I had to discover new "go to" commands for dealing with unmanaged memory dumps.
Lets talk like we are 2
As in two years old. First, two year olds are adorable. But the real reason is that's pretty much the level where I operate and whatever windbg maturity I gained over the past few days will be lost the next time I need it. I'm writing this post to my future self who has fewer fast twitching brain cells that I have now and is in a bind and needs a clear explanation of how to navigate in a sea of memory addresses.
So to those who are more seasoned, this may not be for you and you may know a bunch more tricks to make this all simpler and for you god created commenting engines.
So here is the scenario. You notice that your windows nodes under Chef management are not converging. Maybe, they are supposed to converge every few minutes but they have not converged for hours.
So you figure something has happened to cause my nodes to fail and expect to find chef client logs showing chef run after run with the same failures. Nothing new, you will debug the error and fix it. So you open the log and see the latest chef run has just started a new run. Then you look at the time stamp and notice it started hours ago but never moved past finding its runlist. Its just sitting there hung.
No error message to troubleshoot. There is something wrong but the data stops there. What do you do?
Two words...windbg my friend...windbg. Ok its not two words but kind of has a two word ring to it.
What is this windbg?
Windbg is many things really. At its core, its simply a debugger and many use it to attach to live processes and step through execution. That's not usually a good idea when troubleshooting a multithreaded application like a web site but may not be bad for a chef run. However, I have never used it in this manner.
Another very popular usage is to take a snapshot of a process, also called a memory dump, and use it to deeply examine exactly what was going on in the system at that point in time.
The great thing is that this snapshot is very complete. It has access to all memory in the process, all threads and all stack traces. However the rub is that it is very raw data. Its just raw memory, a bunch of addresses pointers and hex values that may cause more confusion than help.
There are several commands to help sort out the sea of data but its likely far less familiar, intuitive or efficient than your day to day dev tools.
This is one reason why I write this post and why I wrote my last windbg post, I can never remember this stuff and the act of committing my immediate memory to writing and having a permanent record of this learning will help me when I inevitably have another similar problem in the future.
Taking a dump
Thats really what we call this. Seriously. I sit with straight faced, well paid, professional adults and ask them to take dumps and give them to me.
Oh stop it.
Seriously though, this is the first stumbling point in the debugging endeavor. There are several kinds of memory dumps (crash dumps, hang dumps, minidumps, user mode dumps, kernel dumps, etc), each have their merits and there are different ways to obtain them and some means are more obscure than others.
For debugging ruby hangs, we generally just need a user mode dump of the ruby.exe process. This is not going to be a thorough discussion on all the different types of dumps and the means to produce them but I will cover a couple options.
In recent versions of windows, they come equipped with a crash dump generation tool that anyone can easily access. Simply right click on the process you want to examine and then select "create dump file", this generates a user mode dump file of the selected process. There are a couple downsides to collecting dumps in this manner:
1. These dumps do not include the handle table and therefore any use of the !handle command in windbg will fail.
2. On a 64 bit machine, unless you explicitly invoke the 32 bit task scheduler, you will get a 64 bit dump even of 32 bit processes. This is not a big deal really and we'll talk about how to switch to 32 bit mode in a 64 bit dump.
There is a sysinternals tool, ProcDump, that can be used to generate a dump. This tool allows you to set all kinds of thresholds in order to capture a dump file at just the right time based on CPU or memory pressure as well as other factors. You can also simply capture a dump immediately. I typically run: