Understanding and troubleshooting WinRM connection and authentication: a thrill seeker's guide to adventure / October 19, 2015 by Matt Wrock

Connecting to a remote windows machine is often far more difficult than one would have expected. This was my experience years ago when I made my first attempt to use powershell remoting to connect to an Azure VM. At the time, powershell 2 was the hotness and many were talking up its remoting capabilities. I had been using powershell for about a year at the time and thought I'd give it a go. It wasn't simple at all and took a few hours to finally succeed.

Now armed with 2012R2 and more knowledge its simpler but lets say you are trying to connect from a linux box using one of the open source WinRM ports, there are several gotchas.

I started working for Chef about six weeks ago and it is not at all uncommon to find customers and fellow employees struggling with failure to talk to a remote windows node. I'd like to lay out in this post some of the fundamental moving parts as well as the troubleshooting decision tree I often use to figure out where things are wrong and how to get connected.

I'll address cross platform scenarios using plain WinRM, powershell remoting from windows and some Chef specific tooling using the knife-windows gem.

Connecting and Authenticating

In my experience these are the primary hurdles to WinRM sweet success. First is connecting. Can I successfully establish a connection on a WinRM port to the remote machine? There are several things to get in the way here. Then a yak shave or two later you get past connectivity but are not granted access. What's that you say? You are signing in with admin credentials to the box?...I'm sorry say that again?...huh?...I just can't hear you.

TL;DR - A WinRM WTF checklist:

I am going to go into detail in this post on the different gotchas and their accompanying settings needed to successfully connect and execute commands on a remote windows machine using WinRM. However, if you are stuck right now and don't want to sift through all of this, here is a cheat sheet list of things to set to get you out of trouble:

On the remote windows machine:

Run Enable-PSRemoting
Open the firewall with: netsh advfirewall firewall add rule name="WinRM-HTTP" dir=in localport=5985 protocol=TCP action=allow
Accessing via cross platform tools like chef, vagrant, packer, ruby or go? Run these commands:

winrm set winrm/config/client/auth '@{Basic="true"}'
winrm set winrm/config/service/auth '@{Basic="true"}'
winrm set winrm/config/service '@{AllowUnencrypted="true"}'

Note: DO NOT use the above winrm settings on production nodes. This should be used for tets instances only for troubleshooting WinRM connectivity.

This checklist is likely to address most trouble scenarios when accessing winrm over HTTP. If you are still stuck or want to understand this domain more, please read on.

Barriers to entry

Lets talk about connectivity first. Here are the key issues that can prevent connection attempts to a WinRM endpoint:

The Winrm service is not running on the remote machine
The firewall on the remote machine is refusing connections
A proxy server stands in the way
Improper SSL configuration for HTTPS connections

We'll address each of these scenarios but first...

How can I know if I can connect?

It can often be unclear whether we are fighting a connection or authentication problem. So I'll point out how you can determine if you can eliminate connectivity as a potential issue.

On Mac/Linux:

$ nc -z -w1 <IP or host name> 5985;echo $?

This uses netcat available on the mac and most linux distros. Assuming you are using the default HTTP based WinRM port 5985 (more on determining the correct port in just a bit), if the above returns 0, you know you are getting through to a listening WinRM endpoint on the other side.

On Windows:

Test-WSMan -ComputerName <IP or host name>

Again this assumes you are trying to connect over the default HTTP WinRM port (5985), if not add -UseSSL. This should return some non-error response that looks something like:

wsmid         : http://schemas.dmtf.org/wbem/wsman/identity/1/wsmanidentity.xsd
ProtocolVersion : http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd
ProductVendor   : Microsoft Corporation
ProductVersion  : OS: 0.0.0 SP: 0.0 Stack: 3.0

WinRM Ports

The above commands used the default WinRM HTTP port to attempt to connect to the remote WinRM endpoint - 5985. WinRM is a SOAP based HTTP protocol.

Side Note: In 2002, I used to car pool to my job in Sherman Oaks California with my friend Jimmy Bizzaro and would kill time by reading "Programming Web Services with SOAP" an O'Reilly publication. This was cutting edge, cool stuff. Java talking to .net, Java talking to Java but from different machines. This was the future. REST was done in a bed or on a toilet. So always remember, today's GO and Rust could be tomorrow's soap.

Anyhoo...WinRM can talk HTTP and HTTPS. The default ports are 5985 and 5986 respectfully. However the default ports can be changed. Now usually the change is driven by network address translation. Sure these ports can be changed locally too, but in my experience if you need to access WinRM on ports other than 5985 or 5986 its usually to accommodate NAT. So check your Virtualbox NAT config or your Azure or EC2 port mappings to see if there is a port forwarding to 5985/6 on the VM. Those would be the ports you need to use. The Test-WSMan cmdlet also takes a -port parameter where you can provide a non standard WinRM port.

So now you know the port to test but you are getting a non 0 netcat response or an error thrown from Test-WSMan. Now What?

Is WinRM turned on?

This is the first question I ask. If winrm is not listening for requests, then there is nothing to connect to. There are a couple ways to do this. What you usually do NOT want to do is simply start the winrm service. Not that that is a bad thing, its just not likely going to be enough. The two best ways to "turn on" WinRM are:

winrm quickconfig

or the powershell approach:

Enable-PSRemoting

For default windows 2012R2 installs (not altered by group policy), this should be on by default. However windows 2008R2 and client SKUs will be turned off until enabled.

Foiled by Public Network Location

You may get the following error when enabling winrm:

Set-WSManQuickConfig : <f:WSManFault xmlns:f="http://schemas.microsoft.com/wbem/wsman/1/wsmanfault" Code="2150859113"
Machine="localhost"><f:Message><f:ProviderFault provider="Config provider"
path="%systemroot%\system32\WsmSvc.dll"><f:WSManFault xmlns:f="http://schemas.microsoft.com/wbem/wsman/1/wsmanfault"
Code="2150859113" Machine="win81"><f:Message>WinRM firewall exception will not work since one of the network
connection types on this machine is set to Public. Change the network connection type to either Domain or Private and
try again. </f:Message></f:WSManFault></f:ProviderFault></f:Message></f:WSManFault>
At line:1 char:1
+ Set-WSManQuickConfig -Force
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [Set-WSManQuickConfig], InvalidOperationException
    + FullyQualifiedErrorId : WsManError,Microsoft.WSMan.Management.SetWSManQuickConfigCommand

You need to set the Network Location to Private. I have written a post devoted to Internet Connection Type. There are different ways to set the location on different windows versions. You can view the details in the above post but the one that is the most obscure but universally works across all versions is:

$networkListManager = [Activator]::CreateInstance([Type]::GetTypeFromCLSID([Guid]"{DCB00C01-570F-4A9B-8D69-199FDBA5723B}")) 
$connections = $networkListManager.GetNetworkConnections() 

# Set network location to Private for all networks 
$connections | % {$_.GetNetwork().SetCategory(1)}

Wall of fire

In some circles called a firewall. This can often be a blocker. For instance, while winrm is on by default on 2012R2, its firewall rules will block public traffic from outside its own subnet. So if you are trying to connect to a server in EC2 or Azure for example, opening this firewall restriction is important and can be done with:

HTTP:

netsh advfirewall firewall add rule name="WinRM-HTTP" dir=in localport=5985 protocol=TCP action=allow

HTTPS:

netsh advfirewall firewall add rule name="WinRM-HTTPS" dir=in localport=5986 protocol=TCP action=allow

This also affects client SKUs which by default do not open the firewall to any public traffic. If you are on a client version of windows 8 or higher, you can also use the -SkipNetworkProfileCheck switch when enabling winrm via Enable-PSRemoting which will at least open public traffic to the local subnet and may be enough if connecting to a machine on a local hypervisor.

Proxy Servers

As already stated, WinRM runs over http. Therefore if you have a proxy server sitting between you and the remote machine you are trying to connect to, you need to make sure that the request goes through that proxy server. This is usually not an issue if you are on a windows machine and using a native windows API like powershell remoting or winrs to connect. They will simply use the proxy settings in your internet settings.

Ruby tooling like Chef, Vagrant, or others uses a different mechanism. If the tool is using the WinRM ruby gem, like chef and vagrant do, they rely on the HTTP_PROXY environment variable instead of the local system's internet settings. As of knife-windows 1.1.0, the http_proxy settings in your knife.rb config file will make its way to the HTTP_PROXY environment variable. You can manually set this as follows:

Mac/Linux:

$ export HTTP_PROXY="http://<proxy server>:<proxy port>/"

Windows Powershell:

$env:HTTP_PROXY="http://<proxy server>:<proxy port>/"

Windows Cmd:

exit

Friends don't let friends use cmd.exe and you are my friend.

SSL

I'm saving SSL for the last connection issue because it is more involved (why folks often opt for HTTP over the more secure HTTPS). There is extra configuration required both on both the remote and local side and that can vary by local platform. Lets first discuss what must be done on the remote WinRM endpoint.

Create a self signed certificate

Assuming you have not purchased a SSL certificate from a valid certificate authority, you will need to generate a self signed certificate. If your are on a 2012R2 windows os version or later, this is trivial:

$c = New-SelfSignedCertificate -DnsName "<IP or host name>" -CertStoreLocation cert:\LocalMachine\My

Read ahead for issues with New-SelfSignedCertificate and certificate verification with openssl libraries.

Creating a HTTPS WinRM listener

Now WinRM needs to be configured to respond to https requests. This is done by adding an https listener and associating it with the thumbprint of the self signed cert you just created.

winrm create winrm/config/Listener?Address=*+Transport=HTTPS "@{Hostname=`"<IP or host name>`";CertificateThumbprint=`"$($c.ThumbPrint)`"}"

Adding firewall rule

Finally enable winrm https requests through the firewall:

netsh advfirewall firewall add rule name="WinRM-HTTPS" dir=in localport=5986 protocol=TCP action=allow

SSL client configuration

At this point you should be able to reach a listening WinRM endpoint on the remote server. On a mac or linux box, a netcat check on the https winrm port should be successful:

$ nc -z -w1 <IP or host name> 5986;echo $?

On Windows, runing Test-NetConnection (a welcome alternative to telnet on windows 8/2012 or higher) should show an open TCP port:

C:\> Test-netConnection <IP> -Port 5986

ComputerName           : <IP>
RemoteAddress          : <IP>
RemotePort             : 5986
InterfaceAlias         : vEthernet (External Virtual Switch)
SourceAddress          : <local IP>
PingSucceeded          : True
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded       : True

However, trying to establish a WinRM connection will likely fail with a certificate validation error unless you install that same self signed cert you created on the remote endpoint.

If you try to test the connection on windows using Test-WSMan as we saw before, you would receive this error:

Test-WSMan -ComputerName 192.168.1.153 -UseSSL
Test-WSMan : <f:WSManFault
xmlns:f="http://schemas.microsoft.com/wbem/wsman/1/wsmanfault" Code="12175"
Machine="ultrawrock"><f:Message>The server certificate on the destination
computer (192.168.1.153:5986) has the following errors:
The SSL certificate is signed by an unknown certificate authority.
The SSL certificate contains a common name (CN) that does not match the
hostname.     </f:Message></f:WSManFault>
At line:1 char:1
+ Test-WSMan -ComputerName 192.168.1.153 -UseSSL
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (192.168.1.153:String) [Test-W
   SMan], InvalidOperationException
    + FullyQualifiedErrorId : WsManError,Microsoft.WSMan.Management.TestWSManC
   ommand

Now you have a few options depending on your platform and needs:

Do not install the certificate and disable certificate verification (not recommended)
Install to the windows certificate store if you are on windows and need to use native windows APIs like powershell remoting
Export the certificate to a .pem file for use with ruby based tools like chef

Ignoring certificate validation errors

This is equivalent to when you are browsing the internet in a standard browser and try to view a https based site with an invalid cert and the browser gives you a scary warning that you are about to go somewhere potentially dangerous but gives you the option to go there anyway even though that's probably a really bad idea.

If you are testing, especially using a local hypervisor, the risk of a man in the middle attack is pretty small, but you didn't hear that from me. If you do not want to go through the trouble of installing the certificate (we'll go through those steps shortly), here is what you need to do:

Powershell Remoting:

$options=New-PSSessionOption -SkipCACheck -SkipCNCheck
Enter-PSSession -ComputerName <IP or host name> -Credential <user name> -UseSSL -SessionOption $options

WinRM Ruby Gem:

irb
require 'winrm'
w=WinRM::WinRMWebService.new('https://<ip or host>:5986/wsman', :ssl, :user => '<user>', :pass => '<password>', :no_ssl_peer_verification => true)

Chef's knife-windows

knife winrm -m <ip> ipconfig -x <user> -P <password> -t ssl --winrm-ssl-verify-mode verify_none

Installing to the Windows Certificate store

This is the more secure route and will allow you to interact with the machine via powershell remoting without being nagged that your certificate is not valid.

The first thing to do is download the certificate installed on the remote machine:

$webRequest = [Net.WebRequest]::Create("https://<ip or host>:5986/wsman")
try { $webRequest.GetResponse() } catch {}
$cert = $webRequest.ServicePoint.Certificate

Now we have an X509Certificate instance of the certificate used by the remote winrm HTTPS listener. So we install it in our local machine certificate store along with the other root certificates:

$store = New-Object System.Security.Cryptography.X509Certificates.X509Store `
  -ArgumentList  "Root", "LocalMachine"
$store.Open('ReadWrite')
$store.Add($cert)
$store.Close()

Having done this, we can now validate the SSL connection with Test-WSMan:

C:\> Test-WSMan -ComputerName 192.168.43.166 -UseSSL
wsmid        : http://schemas.dmtf.org/wbem/wsman/identity/1/wsmanidentity.xsd
ProtocolVersion : http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd
ProductVendor   : Microsoft Corporation
ProductVersion  : OS: 0.0.0 SP: 0.0 Stack: 3.0

Now we can use tools like powershell remoting or winrs to talk to the remote machine.

Exporting the certificate to a .pem/.cer file

The above certificate store solution works great on windows for windows tools, but it won't help for many cross platform scenarios like connecting from non-windows or using chef tools like knife-windows. The WinRM gem used by tools like Chef and Vagrant take a certificate file which is expected to be a base 64 encoded public key only certificate file. It commonly has a .pem, .cer, or .crt extension.

On windows you can export the X509Certificate we downloaded above to such a file by using the following lines of powershell:

"-----BEGIN CERTIFICATE-----" | Out-File cert.pem -Encoding ascii
[Convert]::ToBase64String($cert.Export('cert'), 'InsertLineBreaks') |
  Out-File .\cert.pem -Append -Encoding ascii
"-----END CERTIFICATE-----" | Out-File cert.pem -Encoding ascii -Append

With this file you could use Chef's knife winrm command from the knife-windows gem to run commands on a windows node:

knife winrm -m 192.168.1.153 ipconfig -x administrator -P Pass@word1 -t ssl -f cert.pem

Problems with New-SelfSignedCertificate and openssl

If the certificate on the server was generated using New-SelfSignedCertificate, cross platform tools that use openssl libraries may fail to verify the certificate unless New-SelfSignedCertificate was used with the -CloneCert argument and passed a certificate that includes a BasicConstraint property identifying it as a CA. Viewing the certificate's properties in the certificate manager GUI, the certificate should contain this:

There are are several alternatives to the convenient New-SelfSignedCertificate cmdlet if you need a cert that must be verified with openssl libraries:

Disable peer verification (not recommended) as shown earlier
Create a private/public key certificate using openssl's req command and then use openssl pkcs12 to combine those 2 files to a pfx file that can be imported to the winrm listener's certificate store. Note: Make sure to include the "Server Authentication" Extended Key Usage (EKU) not added by default
Use the handy New-SelfSignedCertificateEx available from the Technet Script Center and provides finer grained control of the certificate properties and make sure to use the -IsCA argument:

. .\New-SelfSignedCertificateEx.ps1
New-SelfsignedCertificateEx -Subject "CN=$env:computername" `
 -EKU "Server Authentication" -StoreLocation LocalMachine `
 -StoreName My -IsCA $true

Exporting the self signed certificate on non-windows

If you are not on a windows machine, all this powershell is going to produce far different output than what is desirable. However, its actually even simpler to do this with the openssl s_client command:

openssl s_client -connect <ip or host name>:5986 -showcerts </dev/null 2>/dev/null|openssl x509 -outform PEM >mycertfile.pem

The output mycertfile.pem can now be passed to the -f argument of knife-windows commands to execute commands via winrm:

mwrock@ubuwrock:~$ openssl s_client -connect 192.168.1.155:5986 -showcerts </dev/null 2>/dev/null|openssl x509 -outform PEM >mycertfile.pem
mwrock@ubuwrock:~$ knife winrm -m 192.168.1.155 ipconfig -x administrator -P Pass@word1 -t ssl -f ~/mycertfile.pem
WARNING: No knife configuration file found
168.1.155
168.1.155 Windows IP Configuration
168.1.155
168.1.155
168.1.155 Ethernet adapter Ethernet:
168.1.155
168.1.155    Connection-specific DNS Suffix  . :
168.1.155    Link-local IPv6 Address . . . . . : fe80::6c3f:586a:bdc0:5b4c%12
168.1.155    IPv4 Address. . . . . . . . . . . : 192.168.1.155
168.1.155    Subnet Mask . . . . . . . . . . . : 255.255.255.0

Authentication

As you can probably tell so far, alot can go wrong and there are several moving parts to establishing a successful connection with a remote windows machine over WinRM. However, we are not there yet. Most of the gotchas here are when you are using HTTP instead of HTTPS and you are not domain joined. This tends to describe 95% of the dev/test scenarios I come in contact with.

As we saw above, there is quite a bit of ceremony involved in getting SSL just right and running WinRM over HTTPS. Lets be clear: its the right thing to do especially in production. However, you can avoid the ceremony but that just means there is other ceremonial sacrifices to be made. At this point, if you are connecting over HTTPS, authentication is pretty straight forward. If not, there are often additional steps to take. However these additional steps tend to be less friction laden, but more security heinous, than the SSL setup.

HTTP, Basic Authentication and cross-platform

Both the Ruby WinRM gem and the Go winrm package do not interact with the native windows APIs needed to make Negotiate authentication possible and therefore must use Basic Authentication when using the HTTP transport. So unless you are either using native windows WinRM via winrs or powershell remoting or using knife-windows on a windows client (more on this in a bit), you must tweak some of the WinRM settings on the remote windows server to allow plain text basic authentication over HTTP.

Here are the commands to run:

winrm set winrm/config/client/auth '@{Basic="true"}'
winrm set winrm/config/service/auth '@{Basic="true"}'
winrm set winrm/config/service '@{AllowUnencrypted="true"}'

One bit of easy guidance here is that if you can't use Negotiate authentication, you really really should be using HTTPS with verifiable certificates. However if you are just trying to get off the ground with local Vagrant boxes and you find yourself in a situation getting WinRM Authentication errors but know you are passing the correct credentials, please try running these on the remote machine before inflicting personal bodily harm.

I always include these commands in windows packer test images because that's what packer and vagrant need to talk to a windows box since they always use HTTP and are cross platform without access to the Negotiate APIs.

This is quite the security hole indeed but usually tempered by the fact that it is on a test box in a NATed network on the local host. Perhaps we are due for a vagrant PR allowing one to pass SSL options in the Vagrantfile. That would be simple to add.

Chef's winrm-s gem using windows negotiate on windows

Chef uses a separate gem that mostly monkey patches the WinRM gem if it sees that winrm is authenticating from windows to windows. In this case it leverages win32 APIs to use Negotiate authentication instead of Basic Authentication and therefore the above winrm settings can be avoided. However, if accessing from a linux client, it will drop to Basic Authentication and the settings shown above must then be present.

Local user accounts

Windows remote communication tends to be easier when you are using domain accounts. This is because domains create implicit trust boundaries so windows adds restrictions when using local accounts. Unfortunately the error messages you can sometimes get do not at all make it clear what you need to do to get past these restrictions. There are two issues with local accounts that I will mention:

Qualifying user names with the "local domain"

One thing that has previously tripped me up and I have seen others struggle with is related to authenticating local users. You may have a local user (not a domain user) and it is getting access denied errors trying to login. However if you prefix the user name with './', then the error is resolved. The './' prefix is equivelent to '<local host or ip>\<user>'. Note that the './' prefix may not work in a windows login dialog box. In that case use the host name or IP address of the remote machine instead of '.'.

Setting the LocalAccountTokenFilterPolicy registry setting

This does not apply to the built in administrator account. So if you only logon as administrator, you will not run into this. However lets say I create a local mwrock account and even add this account to the local Administrators security group. If I try to connect remotely with this account using the default remoting settings on the server, I will get an Access Denied error if using powershell remoting or a WinRMAuthentication error if using the winrm gem. This is typically only visible on 2012R2. By default, the winrm service is running on a newly installed 2012R2 machine with an HTTP listener but without the LocalAccountTokenFilterPolicy enabled, while 2008R2 and client SKUs have no winrm service running at all. Running winrm quickconfig or Enable-PSRemoting on any OS will enable the LocalAccountTokenFilterPolicy, which will allow local accounts to logon. This simply sets the LocalAccountTokenFilterPolicy subkey of HKLM\software\Microsoft\Windows\CurrentVersion\Policies\system to 1.

Trusted Hosts with HTTP, non domain joined powershell remoting

There is an additional security restriction imposed by powershell remoting when connected over HTTP on a non domain joined (work group) environment. You need to add the host name of the machine you are connecting to the list of trusted hosts. This is a white list of hosts you consider ok to talk to. If there are many, you can comma delimit the list. You can also include wildcards for domains and subdomains:

Set-Item "wsman:\localhost\client\trustedhosts" -Value 'mymachine,*.mydomain.com' -Force

Setting your trusted hosts list a single wildcard would allow all hosts:

Set-Item "wsman:\localhost\client\trustedhosts" -Value '*' -Force

You would only do this if you simply interact with local test instances and even that is suspect.

Double-Hop Authentication

Lets say you want to access a UNC share on the box you have connected to or in any other way use your current credentials to access another machine. This will typically fail with the ever informative Access Denied error. You can enable whats called credential delegation by using a different type of authentication mechanism called CredSSP. This is only available using Powershell remoting and requires extra configuration on both the client and remote machines.

On the remote machine, run:

Enable-WSManCredSSP -Role Server

On the client there are a few things to set up. First, similar to the server, you need to enable it but also specify a white list of endpoints. This is formatted similar to the trusted hosts discussed above:

Enable-WSManCredSSP -Role Client -DelegateComputer 'my_trusted_host.me.org'

Next you need to edit the local security policy on the machine to allow delegation to specific endpoints. In the gpedit GUI, navigate to Computer Configuration > Administrative Templates > System > Credential Delegation and enable "Allow Delegating Fresh Credentials". Further, you need to add the endpoints you authorize delegation to. You can add WSMAN\*.my_domain.com to allow all endpoints in the my_domain.com domain. You can add as many entries as you need.

Certificate based authentication

Even more secure than usernames and passwords is using a x509 certificate signed by a trusted certificate authority. Many use this techniue when using SSH with SSH keys. Well, the same is possible with WinRM. I won't get into the details here since I have blogged separately on this topic here.

Windows Nano TP 3

As of the date of this post, Microsoft has released technical preview 3 of its new Windows Nano flavored server OS. I have previously blogged about this super light weight os but here is a winrm related bit of info that is unique to nano as of this version at least: there are no tools to tweak the winrm settings. Neither the winrm command or the winrm powershell provider are present.

In order to make changes, you must edit the registry directly. These settings are located at:

HKLM\Software\Microsoft\Windows\CurrentVersion\WSMAN

Other Caveats

I've written an entire post on this topic and will not go into the same detail here. Basically I have found that once winrm is correctly configured, there is still a small subset of operations that will fail in a remote context. Any interaction with wsus is an example but please read my previous post for more. When you hit one of these road blocks, you typically have two options:

Use a Scheduled Task to execute the command in a local context
Install an SSH server and use that

The second option appears to be imminent and in the end will make all of this easier and perhaps render this post irrelevant.

A Packer template for Windows Nano server weighing 300MB / September 28, 2015 by Matt Wrock

Since the dawn of time, human kind has struggled to produce Windows images under a gigabyte and failed. We have all read the stories from the early Upanishads, we have studied the Zoroastrian calculations, recited the talmudic laws governing SxS yet continue to grow ever older as we wait for our windows pets to find an IP for us to RDP to. Well hopefully these days are nearing an end. I think its pretty encouraging that I can now package a windows VM in a 300MB vagrant package.

This post is going to walk through the details and pitfalls of creating a Packer template for Windows Nano Vagrant boxes. I have already posted on the basics of Packer templates and Vagrant box packaging. This post will assume some knowledge of Packer and Vagrant basic concepts.

Windows Nano, a smaller windows

Windows nano finally brings us vm images of similar relative size to its linux cousins. The one I built for VirtualBox is about 307MB. This is 10x smaller than the smallest 2012R2 box I have packaged at around 3GB.

Why so much smaller?

Here are a few highlights:

No GUI. Really this time. No notepad and no cmd.exe window. Its windows without windows.
No SysWow64. Nano completely abandons 32 bit compatibility, but I'm bummed there will be no SysWowWow128.
Minimal packages and features in the base image. The windows team have stripped down this OS to a minimal set of APIs and features. You will likely find some of your go to utilities missing here, but thats ok because it likely has another and probably better API that accomplishes the same functionality.

Basically Microsoft is letting Backwards compatibility slide on this one and producing an OS that does not try to support legacy systems, but is far more effective at managing server cattle.

Installation challenges

Windows Nano does not come packaged in a separate ISO nor does it bundle as a separate image inside the ISO like most of the other server SKUs such as Standard Server or Data Center. Instead you need to build the image from bits in the installation media and extract that.

If you want to host Nano on Hyper-V, running the scripts to build and extract this image are shockingly easy. Even if you want to build a VirtualBox VM, things are not so bad. However there are more moving parts and some elusive gotchas when preparing a Packer template.

Just show me the template

Before I go into detail, mainly as a cathartic act of self governed therapy to recover from the past week of yak shaving, lets just show how to start producing and consuming packer templates for Nano images today. The template can be found here in my packer-templates repository. I'm going to walk through the template and the included scripts but that is optional reading.

I'm running Packer 0.8.2 and Virtualbox 5.0.4 on Windows 8.1 to build the template.

Known Issues

There were several snags but here are a couple items that just didn't work and may trip you up when you first try to build the template or Vagrant up:

I had upgraded to the latest Packer version, 0.8.6 at the time of this post, and had issues with WinRM connectivity so reverted back to 0.8.2. I do plan to investigate that and alter the template to comply with the latest version or file issue(s) and/or PRs if necessary.
Vagrant up will fail but may succeed to the extent that you need it to. It will fail to establish a WinRM connection with the box but it will create a connectable box and can also destroy it. This does mean that you will not have luck using any vagrant provisioners or packer provisioners. For me, that's fine for now.

The reason for the latter issue is that the WinRM service in nano expects requests to use codepage 65001 (UTF-8) and will refuse requests that do not. The WinRM ruby gem used by Vagrant uses codepage 437 and you will see exceptions when it tries to connect. Previous windows versions have accepted both codepages and I have heard that this will be the case with nano by the time it officially ships.

Connecting and interacting with the Nano Server

I have been connecting via powershell remoting. That of coarse assumes you are connecting from Windows. Despite what I said above about the limitations of the ruby WinRM gem, it does have a way to override the 437 codepage. However, doing so is not particularly friendly and means you cannot use alot of the helper methods in the gem.

To connect via powershell, run:

# Enable powershell remoting if it is not already enabled
Enable-PSRemoting -Force

# You may change "*" to the name or IP of the machine you want to connect to
Set-Item "wsman:\localhost\client\trustedhosts" -Value "*" -Force

# the password is vagrant
$creds = Get-Credential vagrant

# this assumes you are using NAT'd network which is the Virtualbox default
# Use the computername or IP of the machine mand skip the port arg
# if you are using Hyper-V or another non NAT network
Enter-PSSession -Computername localhost -Port 55985 -Credential $creds

If you do not have a windows environment from which to run a remote powershell session, you can just create a second VM.

Deploying Nano manually

Before going through the packer template, it would be helpful to understand how one would build a nano server without packer or by hand. Its a bit more involved that giving packer an answer file. There are a few different ways to do this and some paths work better for different scenarios. I'll just layout the procedure for building Nano on virtualbox.

From Windows hosts

Ben Armstrong has a great post on creating nano VMs for Hyper-V. If you are on Windows and want to create Virtualbox VMs, the instructions for creating the nano image are nearly identical. The key change is to specify -OEMDrivers instead of -GuestDrivers in the New-NanoServerImage command. GuestDrivers have the minimal set of drivers needed for Hyper-V. While it can also Create a VirtualBox image that loads and shows the initial nano login screen, I was unable to actually login. Using -OEMDrivers adds a larger set of drivers and allows the box to function in VirtualBox. Its interesting to note that a Hyper-V Vagrant Box build using GuestDrivers is 60MB smaller than one using OEMDrivers.

Here is a script that will pop out a VHD after you mount the Windows Server 2016 Technical Preview 3 ISO:

cd d:\NanoServer
. .\new-nanoserverimage.ps1
mkdir c:\dev\nano
$adminPassword = ConvertTo-SecureString "Pass@word1" -AsPlainText -Force

New-NanoServerImage `
  -MediaPath D:\ `
  -BasePath c:\dev\nano\Base `
  -TargetPath c:\dev\nano\Nano-image `
  -ComputerName Nano `
  -OEMDrivers `
  -ReverseForwarders `
  -AdministratorPassword $adminPassword

Now create a new Virtualbox VM and attach to the VHD created above.

From Mac or Linux hosts

You have no powershell here so the instructions are different. Basically you need to either create or use an existing windows VM. Make sure you have a shared folder setup so that you can easily copy the nano VHD from the windows VM to your host and then create the Virtualbox vm using that VHD as its storage.

That all seems easy, why Packer?

So you may very well be wondering at this point, "Its just a handful of steps to create a nano VM. Your packer template has multiple scripts and probably 100 lines of powershell. What is the advantage of using Packer?"

First there might not be one. If you want to create one instance and play around on the same host and don't care about supporting other instances on other hosts or have scenarios where you need to ensure that multiple nodes come from an identically built image, then packer may not be the right tool for you.

Here are some scenarios where packer shines:

Guaranteed identical images - If all images come from the same template, you know that they are all the same and you have "executable documentation" on how they were produced.
Immutable Infrastructure - If I have production clusters that I routinely tear down and rebuild/replace or a continuous delivery pipeline that involves running tests on ephemeral VMs that are freshly built for each test suite, I can't be futzing around on each node, copying WIMs and VHDs.
Multi-Platform - If I need to create both linux and windows environments, I'd prefer to use a single tool to pump out the base images.
Single click, low friction box sharing - For the thousands and thousands of vagrant users out there, many of whom do not spend much time on windows, giving them a vagrant box is the best way to ensure they have a positive experience provisioning the right image and Packer is the best tool for creating vagrant boxes.

Walking through the template

So now we will step through the key parts of the template and scripts highlighting areas that stray from the practices you would normally see in windows template work and dwelling on nano behavior that may catch you off guard.

High level flow

First a quick summary of what the template does:

Installs Windows Server 2016 Core on a new Virtualbox VM
Powershell script is launched from the answer file that creates the Nano image, mounts it, copies it to an empty partition and then updates the default boot record to boot from that partition.
Machine reboots into nano
Some winrm tweaks are made, the Windows Server 2016 partition is removed and the nano partition extended over it.
"Zap" unused space on disk.
Packer archives the VM to vmdk and packages to a .box file.

Three initial disk partitions

We assume that there is no windows anywhere (because this reflects many build environments) so we will be installing two operating systems: The larger Windows Server 2016 and Nano. We build nano from the former. Our third partition is a system partition. Its easier to have a separate partition for the master boot record that we don't have to touch or move around in the process.

It is important that the Windows Server 2016 Partition be physically located at the end of the disk. Otherwise we will be stuck with a gap in the disk after we remove it.

One may find it odd that our Autounattend.xml file installs Server 2016 from an image named "Windows Server 2012 R2 SERVERDATACENTERCORE." It is odd but correct. That's cool. This is all beta still and I'm sure this is just one detail yet to be ironed out. There is probably some horrendously friction laden process involved to change the image name. One thing that tripped me up a bit is that there are 4 images in the ISO:

C:\dev\test> Dism /Get-ImageInfo /ImageFile:d:\sources\install.wim

Deployment Image Servicing and Management tool
Version: 10.0.10240.16384

Details for image : d:\sources\install.wim

Index : 1
Name : Windows Server 2012 R2 SERVERSTANDARDCORE
Description : Windows Server 2012 R2 SERVERSTANDARDCORE
Size : 9,621,044,487 bytes

Index : 2
Name : Windows Server 2012 R2 SERVERSTANDARD
Description : Windows Server 2012 R2 SERVERSTANDARD
Size : 13,850,658,303 bytes

Index : 3
Name : Windows Server 2012 R2 SERVERDATACENTERCORE
Description : Windows Server 2012 R2 SERVERDATACENTERCORE
Size : 9,586,595,551 bytes

Index : 4
Name : Windows Server 2012 R2 SERVERDATACENTER
Description : Windows Server 2012 R2 SERVERDATACENTER
Size : 13,847,190,006 bytes

The operation completed successfully.

Images 3 and 4, the DataCenter ones are the only ones installable from an answer file.

Building Nano

I think .\scripts\nano_create.ps1 is pretty straight forward. We build the nano image as discussed earlier in this post and copy it to a permanent partition.

What might seem odd is the last few lines that setup winrm. Why do we do this when we are about to blow away this OS and never use winrm? We do this because of the way that the VirtualBox builder works in packer. It is currently waiting for winrm to become available before moving forward in the build process. So this is done simply as a signal to packer. A signal to what?

The Virtualbox builder will now invoke any "provisioners" in the template and then issue the template's shutdown command. We dont use any provisioners which brings us to the our first road bump.

Nano forces a codepage incompatible with packer and vagrant

On the one hand it is good to see Nano using a Utf-8 code page (65001). However, previous versions of Windows have traditionally used the old MS-DOS code page (437) and both the ruby WinRM gem used by Vagrant and the GO WinRM package used by packer are hard coded to use 437. At this time, Nano will not accept 437 so any attempt to establish WinRM communication by Vagrant and Packer will fail with htis error:

An error occurred executing a remote WinRM command.

Shell: powershell
Command: hostname
if ($?) { exit 0 } else { if($LASTEXITCODE) { exit $LASTEXITCODE } else { exit 1 } }
Message: [WSMAN ERROR CODE: 2150859072]: <f:WSManFault Code='2150859072' Machine='192.168.1.130' xmlns:f='http://schemas.microsoft.com/wbem/wsman/1/wsmanfault'><f:Message><f:ProviderFault path='%systemroot%\system32\winrscmd.dll' provider='Shell cmd plugin'>The WinRS client cannot process the request. The server cannot set Code Page. You may want to use the CHCP command to change the client Code Page to 437 and receive the results in English. </f:ProviderFault></f:Message></f:WSManFault>

This means packer provisioners will not work and we need to take a different route to provisioning.

One may think this a show stopper for provisioning Windows images and it is for some scenarios but for my initial packer use case, that's OK and I hear that Nano will accept 437 before it "ships." Note that this only seems to be the case with Nano and not Windows Server 2016.

Cut off from Winrm Configuration APIs

Both Vagrant and Packer expect to communicate over unencrypted WinRM using Basic Authentication. I know I just said that Vagrant and Packer cant talk WinRM at all but I reached a challenge with WinRM before discovering the codepage issue. When trying to allow unencrypted WinRM and basic auth, I found that the two most popular methods for tweaking winrm were not usable on nano.

These methods include:

Using the winrm command line utility
Using the WSMan Powershell provider

The first simply does not exist. Now the winrm command is c:\windows\system32\winrm.cmd which is a tiny thin wrapper around cscript.exe, the scripting engine used to run vbscripts. Well there is no cscript or wscript so no visual basic runtime at all. Interestingly, winrm.vbs does exist. Feels like a sick joke.

So we could use the COM API to do the configuration. If you like COM constants and HRESULTS, this is totally for you. The easier approach at least for my purposes is to simply flip the registry keys to get the settings I want:

REG ADD HKLM\Software\Microsoft\Windows\CurrentVersion\WSMAN\Service /v allow_unencrypted /t REG_DWORD /d 1 /f

REG ADD HKLM\Software\Microsoft\Windows\CurrentVersion\WSMAN\Service /v auth_basic /t REG_DWORD /d 1 /f

REG ADD HKLM\Software\Microsoft\Windows\CurrentVersion\WSMAN\Client /v auth_basic /t REG_DWORD /d 1 /f

No modules loaded in Powershell scripts run from SetupComplete.cmd

SetupComplete.cmd is a special file that can sit in windows\setup\scripts and if it does, will be run on first boot and then never again. We use this because as mentioned before, we cant use Packer provisioners since winrm is not an option. I have never used this file before so its possible this is not specific to nano but that would be weird. I was wondering why the powershell script I called from this file was not being called at all. Everything seemed to go fine, no errors but my code was definitely not being called. Kinda like debugging scheduled tasks.

First, Start-Transcript is not present on Nano. So that was to blame for the lack of errors. I switched to old school redirection:

cmd.exe /c C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -command . c:\windows\setup\scripts\nano_cleanup.ps1 > c:\windows\setup\scripts\cleanup.txt

Next I started seeing errors about other missing cmdlets like Out-File. I thought that seemed strange and had the script run Get-Module. The result was an empty list of modules so I added loading of the basic PS modules and the storage module, which would normally be auto loaded into my session:

Import-Module C:\windows\system32\windowspowershell\v1.0\Modules\Microsoft.PowerShell.Utility\Microsoft.PowerShell.Utility.psd1
Import-Module C:\windows\system32\windowspowershell\v1.0\Modules\Microsoft.PowerShell.Management\Microsoft.PowerShell.Management.psd1
Import-Module C:\windows\system32\windowspowershell\v1.0\Modules\Storage\Storage.psd1

Not everything you expect is on Nano but likely everything you need

As I mentioned above, Start-Transcipt and cscrpt.exe were missing, but thats not the only things. Here are some other commands I noticed were gone:

diskpart
bcdboot
Get-WMIObject
Restart-Computer

I'm sure there are plenty others but these all have alternatives that I could use.

Different arguments to powershell.exe

A powershell /? will reveal a command syntax slightly different from what one is used to:

C:\dev\test> Enter-PSSession -ComputerName 192.168.1.134 -Credential $c
[192.168.1.134]: PS C:\Users\vagrant\Documents> powershell /?
USAGE: powershell [-Verbose] [-Debug] [-Command] <CommandLine>

  CoreCLR is searched for in the directory that powershell.exe is in,

  then in %windir%\system32\CoreClrPowerShellExt\v1.0\.

No -ExecutionProfile, no -File and others are missing too. I imagine this could break some existing scripts.

No 32 bit

I knew this going in but was still caught off guard when sdelete.exe failed to work. I use sdelete, a sysinternals utility, for freeing empty space on disk which leads to a dramatically smaller image size when we are done. Well I'm guessing it was compiled for 32 bit because I got complaints about the executable image being incomatible with nano.

In the end this turned out to be for the best, I found a pure powershell alternative to sdelete which I adapted for my limited needs:

$FilePath="c:\zero.tmp"
$Volume= Get-Volume -DriveLetter C
$ArraySize= 64kb
$SpaceToLeave= $Volume.Size * 0.05
$FileSize= $Volume.SizeRemaining - $SpacetoLeave
$ZeroArray= new-object byte[]($ArraySize)
 
$Stream= [io.File]::OpenWrite($FilePath)
try {
   $CurFileSize = 0
    while($CurFileSize -lt $FileSize) {
        $Stream.Write($ZeroArray,0, $ZeroArray.Length)
        $CurFileSize +=$ZeroArray.Length
    }
} 
finally {
    if($Stream) {
        $Stream.Close()
    }
}
 
Del $FilePath

Blue Screens of Death

So I finally got the box built and was generally delighted with its size (310MB). However when I launched the vagrant box, the machine blue screened reporting that a critical process had died. All of the above issues had made this a longer haul than I expected but it turned out that troubleshooting the bluescreens was the biggest time suck and sent me on hours of wild goose chases and red herrings. I almost wrote a separate post dedicated to this issue, but I'm gonna try to keep it relatively brief here (not a natural skill).

What was frustrating here is I knew this could work. I had several successful tests but with slightly different execution flows which I was tweaking along the way, but it certainly did not like my final template and scripts. I would get the CRITICAL_PROCESS_DIED blue screen twice and then it would stop at a display of error code 0xc0000225 and the message "a required device isn't connected or can't be accessed."

Based on some searching I thought that there was something wrong somewhere in the boot record. After all I was messing with deleting and resizing partitions and changing the boot record compounded by the fact that I am not an expert in that area. However lots of futzing with diskpoart, cdbedit, cdbboot, and bootrec got me nowhere. I also downloaded the Technical preview 3 debug symbols to analyze the memory dump but there was nothing interesting there. Just a report that the process that died was wininit.exe.

Trying to manually reproduce this I found that the final machine produced by packer was just fine. Packer exports the VM to a new .vmdk virtual disk. Trying to create a machine from that would produce blue screens. Further, manually cloning a .vdi had the same effect - more blue screens. Finally, I tried attaching a new VM to the same disk that worked and made sure the vm settings were identical to the working machine. This failed too which seemed very odd. I then discovered that removing the working machine and manually editing the broken machine 's .box xml to have the same UUID as the working one, fixed things. After more researching, I found out that Virtualbox has a modifiable setting called a Hardware UUID. If none is supplied, it uses the box's UUID. So I cloned another box from the working machine, validated that it blue screened and then ran:

vboxmanage modifyvm --hardwareid "{same uuid as the working box}"

Voila! The box came to life. So I could fix this by telling the packer template to stamp an artificial guid at startup:

    "vboxmanage": [
      [ "modifyvm", "{{.Name}}", "--natpf1", "guest_winrm,tcp,,55985,,5985" ],
      [ "modifyvm", "{{.Name}}", "--memory", "2048" ],
      [ "modifyvm", "{{.Name}}", "--vram", "36" ],
      [ "modifyvm", "{{.Name}}", "--cpus", "2" ],
      [ "modifyvm", "{{.Name}}", "--hardwareuuid", "02f110e7-369a-4bbc-bbe6-6f0b6864ccb6" ]
    ],

then add the exact same guid to the Vagrantfile template:

config.vm.provider "virtualbox" do |vb|
  vb.customize ["modifyvm", :id, "--hardwareuuid", "02f110e7-369a-4bbc-bbe6-6f0b6864ccb6"]
  vb.gui = true
  vb.memory = "1024"
 end

This ensures that vagrant "up"s the box with the same hardware UUID that it was created with. The actual id does not matter and I don't think there is any harm, at least for test purposes, in having duplicate hardware uuids.

I hoped that a similar strategy would work for Hyper-V by changing its BIOSGUID using a powershell script like this:

#Virtual System Management Service
$VSMS = Get-CimInstance -Namespace root/virtualization/v2 -Class Msvm_VirtualSystemManagementService
 
#Virtual Machine
$VM = Get-CimInstance -Namespace root/virtualization/v2  -Class Msvm_ComputerSystem -Filter "ElementName='Demo-VM'"
 
#Setting Data
$SD = $vm | Get-CimAssociatedInstance -ResultClassName Msvm_VirtualSystemSettingData -Association Msvm_SettingsDefineState

#Update bios uuid
$SD.BIOSGUID = "some guid"
 
#Create embedded instance
$cimSerializer = [Microsoft.Management.Infrastructure.Serialization.CimSerializer]::Create()
$serializedInstance = $cimSerializer.Serialize($SD, [Microsoft.Management.Infrastructure.Serialization.InstanceSerializationOptions]::None)
$embeddedInstanceString = [System.Text.Encoding]::Unicode.GetString($serializedInstance)
 
#Modify the system settings
Invoke-CimMethod -CimInstance $VSMS -MethodName ModifySystemSettings @{SystemSettings = $embeddedInstanceString}

Thanks to this post for an example. It did not work but the Hyper-V blue screens seem to be "self healing". I post more details in the vagrant box readme on atlas.

No sysprep

I suspect that the above is somehow connected to OS activation. I saw lots of complaints on google about needing to do something similar with the hardware uuid in order to preserve the windows activation of a cloned machine. I also noted that if I cloned the box manually before packer rebooted into nano thereby letting the clone run the initial setup, things worked.

Ideally, the fix here would be to leave the hardware uuid alone and just sysprep the machine as the final packer step. This is what I do for 2012R2. However from what I can tell, there is no sysprep available for nano. I really hope that there will be.

Finally a version of windows for managing cattle servers

You may have heard the cattle vs. pets analogy that compares "special snowflake" servers to cloud based clusters. The idea is that one perceives cloud instances like cattle. There is no emotional attachment or special treatment given to one server over another. At this scale, one cant afford to. We don't have the time or resources to be accessing our instances via a remote desktop and clicking buttons or dragging windows. If one becomes sick, we put it out of its misery quickly and replace it.

Nano is light weight and made to be accessed remotely. Really interested to see how this progresses.

Troubleshooting Ruby hangs on windows with windbg / September 19, 2015 by Matt Wrock

Ahhh windbg...those who are familiar with it know that you can't live with it and you can't live without it. Unless you are a hardened windows c++ dev (I am not), if you have used windbg, it was in a moment of desperation when all of the familiar tools just wouldn't cut it. So now whenever you hear the word "windbg", it conjours up the archetype of hard to use, arcane, engineering tooling. Step aside git and vi, windbg kicks your ass when it comes to commands that were intended to be forgotten. Yet here's the thing, windbg has saved your but. It allowed you to see things nothing else could show you. Did git or vi provide this value?...ok...maybe they did...but still!!

Many of us have personal windbg stories. These usually involve some horrendous production outage and are mixed with the drama of figuring out a tool that was not meant to just be picked up and mastered. I have several of these stories in my high volume web development days. Here are a couple:

A frequently called function uses a new regex causing CPU to spike accross all web nodes.
Web site becomes unresponsive because the asp.net thread pool is saturated with calls to a third party web service that is experiencing latency issues.

I survived these events and others thanks to windbg. Thanks windbg! But I still hate you and I think I always will.

windbg and ruby?

So why am I talking about windbg and ruby in the same post? Are they not from two different worlds that should remain apart? Well I have now had two separate incidents in the past few months where I have needed to use windbg with a problem in ruby code. Both of these incidents are similar in that they both differ from my previous run ins with windbg where I was debugging managed code and they both were related to a similar Chef related problem.

When it comes to ruby, its all unmanaged code and you may not know the difference between managed and unmanaged code in the windows world and that is totally ok, but if you do, I can tell you there are extensions you can use with windbg (like sos.dll) to make your life easier. Sorry, those do not apply here. Its all native from here on out!

I've blogged long ago about debugging native .net code with windbg, bit one, thats a super old post and two, I had to discover new "go to" commands for dealing with unmanaged memory dumps.

Lets talk like we are 2

As in two years old. First, two year olds are adorable. But the real reason is that's pretty much the level where I operate and whatever windbg maturity I gained over the past few days will be lost the next time I need it. I'm writing this post to my future self who has fewer fast twitching brain cells that I have now and is in a bind and needs a clear explanation of how to navigate in a sea of memory addresses.

So to those who are more seasoned, this may not be for you and you may know a bunch more tricks to make this all simpler and for you god created commenting engines.

The problem

So here is the scenario. You notice that your windows nodes under Chef management are not converging. Maybe, they are supposed to converge every few minutes but they have not converged for hours.

So you figure something has happened to cause my nodes to fail and expect to find chef client logs showing chef run after run with the same failures. Nothing new, you will debug the error and fix it. So you open the log and see the latest chef run has just started a new run. Then you look at the time stamp and notice it started hours ago but never moved past finding its runlist. Its just sitting there hung.

No error message to troubleshoot. There is something wrong but the data stops there. What do you do?

Two words...windbg my friend...windbg. Ok its not two words but kind of has a two word ring to it.

What is this windbg?

Windbg is many things really. At its core, its simply a debugger and many use it to attach to live processes and step through execution. That's not usually a good idea when troubleshooting a multithreaded application like a web site but may not be bad for a chef run. However, I have never used it in this manner.

Another very popular usage is to take a snapshot of a process, also called a memory dump, and use it to deeply examine exactly what was going on in the system at that point in time.

The great thing is that this snapshot is very complete. It has access to all memory in the process, all threads and all stack traces. However the rub is that it is very raw data. Its just raw memory, a bunch of addresses pointers and hex values that may cause more confusion than help.

There are several commands to help sort out the sea of data but its likely far less familiar, intuitive or efficient than your day to day dev tools.

This is one reason why I write this post and why I wrote my last windbg post, I can never remember this stuff and the act of committing my immediate memory to writing and having a permanent record of this learning will help me when I inevitably have another similar problem in the future.

Taking a dump

What?!

Oh...god...no.

Thats really what we call this. Seriously. I sit with straight faced, well paid, professional adults and ask them to take dumps and give them to me.

Oh stop it.

Seriously though, this is the first stumbling point in the debugging endeavor. There are several kinds of memory dumps (crash dumps, hang dumps, minidumps, user mode dumps, kernel dumps, etc), each have their merits and there are different ways to obtain them and some means are more obscure than others.

For debugging ruby hangs, we generally just need a user mode dump of the ruby.exe process. This is not going to be a thorough discussion on all the different types of dumps and the means to produce them but I will cover a couple options.

Task Scheduler

In recent versions of windows, they come equipped with a crash dump generation tool that anyone can easily access. Simply right click on the process you want to examine and then select "create dump file", this generates a user mode dump file of the selected process. There are a couple downsides to collecting dumps in this manner:

1. These dumps do not include the handle table and therefore any use of the !handle command in windbg will fail.

2. On a 64 bit machine, unless you explicitly invoke the 32 bit task scheduler, you will get a 64 bit dump even of 32 bit processes. This is not a big deal really and we'll talk about how to switch to 32 bit mode in a 64 bit dump.

ProcDump

There is a sysinternals tool, ProcDump, that can be used to generate a dump. This tool allows you to set all kinds of thresholds in order to capture a dump file at just the right time based on CPU or memory pressure as well as other factors. You can also simply capture a dump immediately. I typically run:

procdump -ma <PID> <path of dump file>

So in the event of analyzing a ruby process dump, I simply give it the process id of ruby.exe. This will include the handle table and is smart enough to produce a 32 bit dump for 32 bit processes.

There are plenty of other ways to get dumps. I used to use AdPlus.vbs years ago and I *think* that will still work and ships with the debugging tools for windows that also includes windbg. However you get your dump file, it will end in a .dmp extension. If supplying a output path to procdump or other generator, make sure you use that extension when naming the file.

When to take the dump

In this post we are discussing hang scenarios. So when analyzing hangs, take the dump when your process is hanging. If we were analyaing a CPU pinning scenario, we would take the dump when the process was pinning the cpu. This can be easier said than done especially in production environments where hangs and pins may not surface on demand and may take hours or days to occur after a process starts. If you find yourself in such a situation where its tricky to generate the dump file at just the right time, have a look at the various procdump switches that can wait for just the right moment based on several different triggers.

Simulating high memory pressure

In the scenario described here, one of the hang events was preceded by high memory pressure on the chef node as seen from various flavors of OutOfMemoryExceptions on preceding chef runs. To reproduce the scenario, I needed to create an environment where there was little memory available on the box. For the benefit of my future self and others, here is how I did that:

Created a VM with relatively low memory. 512MB was adequate in my case.
Shrunk the paging file down to about 128MB
Used the sysinternals tool, TestLimit to artificially leak memory on the box running: TestLimit64.exe -d 1 -c

At this point based on analyzing previous dumps, I knew exactly where in the chef run the problem was occurring so I just created a little ruby program to run the problem call in a loop until the ruby process hanged.

Getting windbg

Windbg is included in Microsoft's debugging toolset which is included in the Windows SDK. From the microsoft download site, you can download the entire SDK or choose only the debugging tools.

Another option is chocolatey. I maintain the windbg chocolatey package and this package will download and install the windows debugging tools and does a few extra helpful things:

Adds a microsoft symbol server path environment variable that windbg will pick up. Assigning the path symsrv*symsrv.dll*f:\localsymbols*http://msdl.microsoft.com/download/symbols to the _NT_SYMBOL_PATH variable. This is important if you want to be able to make much sense of the dump files. Without symbols, you are left to raw memory addresses with no specific information on the structure and calling patterns of key classes and structures in the call stack.
Copies the sos.dll from your local .net installation to the same folder where windbg is installed. This comes in handy for managed code debugging not covered here.
Adds separate shortcut icons for the 64 bit and 32 bit versions of windbg. It can be important depending on the type of process dumped, to use the corresponding version of windbg. It really does not mater so much for this post.

Simply run:

choco install windbg -y

to install the chocolatey windbg package. I do this on all of my dev machines.

Analyzing dump files

So now you have the tools, windbg, you need and a dump file of a problem process to analyze. The remainder of this post will cover how to make sense of the dump with windbg. I will by no means be covering all the commands possible. In fact I'm just gonna cover a handful which should be all you need but I'd encourage anyone to at least browse over this helpful command summary since there very well may be a command that is better suited for your scenario. I'd also point out that I am NOT a windbg guru and if you see a command or strategy I am missing here that you think would be helpful, please provide a comment.

Loading the dump file

Once windbg is open, navigate to File -> Open Crash Dump and browse to the .DMP file you captured. I know...your process may not have actually "crashed", but this is the correct menu option.

So now your dump is loaded into windbg and the entire memory heap of the process is at your fingertips. Here is the typical path I walk at this point under unmanaged scenarios like a ruby hang.

Which thread do I care about?

Chances are nearly certain that there is more than a single thread captured in the dump and in fact depending on the nature of the app, there can be quite a few. However it is likely you only care about one. At least that was the case in my scenario. I want to examine the thread that is hung.

I'll start by running !runaway. Not because thats the thing I want to do most at this time. Rather this will list all threads and how much user time they have consumed:

0:000> !runaway
 User Mode Time
  Thread       Time
   0:734       0 days 0:20:11.843
   1:94        0 days 0:00:00.078

Here there are only two threads. I am more interested in the one that was running for 20 minutes: thread 0. Now the times displayed here may be lower, perhaps dramatically lower than you expect. My process had been running for a couple hours, what's up with 20 minutes!? Its even possible that your process has been hung for days, and the time displayed here will be a mere milliseconds.

Its important to note that this is "User Time" and is the amount of time that the thread was actually working. To get actual "elapsed time" you can add a parameter to !runaway. I'll run !runaway 7 to get user, kernel and elapsed time:

0:000> !runaway 7
 User Mode Time
  Thread       Time
   0:734       0 days 0:20:11.843
   1:94        0 days 0:00:00.078
 Kernel Mode Time
  Thread       Time
   0:734       0 days 0:13:19.437
   1:94        0 days 0:00:00.031
 Elapsed Time
  Thread       Time
   0:734       0 days 2:58:53.443
   1:94        0 days 2:58:53.412

This is from the dump where I artificially created the high memory pressure environment and the same ruby thread was looping for hours. Of the 3 hours it was looping, the thread itself was only doing work for 20 minutes. Some of the production dumps I looked at had a much higher elapsed time (days) and lower user time (just milliseconds) because Chef had started a ruby process that immediately invoked a blocking call to an external process and then waited for days, during which time it did no work. Lazy bum.

Call stacks

Now lets look at some call stacks. Even though I'm pretty sure at this point that thread 0 is the one I'm interested in, I'll fist browse the stack of all threads using: ~*k

0:000> ~*k

.  0  Id: 90c.734 Suspend: 0 Teb: 7ffdd000 Unfrozen
ChildEBP RetAddr  
0028ea2c 7746112f ntdll!NtWaitForMultipleObjects+0xc
0028ebb8 752dd433 KERNELBASE!WaitForMultipleObjectsEx+0xcc
0028ec1c 76f2fab4 user32!MsgWaitForMultipleObjectsEx+0x163
0028ec54 7702b50c combase!CCliModalLoop::BlockFn+0x111 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1571]
(Inline) -------- combase!ModalLoop+0x52 [d:\9147\com\combase\dcomrem\chancont.cxx @ 129]
(Inline) -------- combase!ClassicSTAThreadWaitForCall+0x52 [d:\9147\com\combase\dcomrem\threadtypespecific.cpp @ 172]
0028ecc0 7702a259 combase!ThreadSendReceive+0x1d3 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5776]
(Inline) -------- combase!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xd7 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5090]
0028ee0c 76f2fe0d combase!CRpcChannelBuffer::SendReceive2+0x1e9 [d:\9147\com\combase\dcomrem\channelb.cxx @ 4796]
(Inline) -------- combase!ClientCallRetryContext::SendReceiveWithRetry+0x31 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1090]
(Inline) -------- combase!CAptRpcChnl::SendReceiveInRetryContext+0x3b [d:\9147\com\combase\dcomrem\callctrl.cxx @ 715]
0028eecc 76f0c65d combase!ClassicSTAThreadSendReceive+0x21d [d:\9147\com\combase\dcomrem\callctrl.cxx @ 696]
(Inline) -------- combase!CAptRpcChnl::SendReceive+0x89 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 752]
0028ef30 7702a010 combase!CCtxComChnl::SendReceive+0x105 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 790]
0028ef54 76dc5769 combase!NdrExtpProxySendReceive+0x5c [d:\9147\com\combase\ndr\ndrole\proxy.cxx @ 2017]
0028ef6c 76e46c1b rpcrt4!NdrpProxySendReceive+0x29
0028f398 77029e1e rpcrt4!NdrClientCall2+0x22b
0028f3b8 76f0c46f combase!ObjectStublessClient+0x6c [d:\9147\com\combase\ndr\ndrole\i386\stblsclt.cxx @ 215]
0028f3c8 7494a5c6 combase!ObjectStubless+0xf [d:\9147\com\combase\ndr\ndrole\i386\stubless.asm @ 171]
0028f440 74d0627f fastprox!CEnumProxyBuffer::XEnumFacelet::Next+0xd6
0028f484 76cca040 wbemdisp!CSWbemObjectSet::get_Count+0xdf
0028f4a0 76cca357 oleaut32!DispCallFunc+0x16f
0028f748 74cfe81f oleaut32!CTypeInfo2::Invoke+0x2d7
0028f784 74d0876e wbemdisp!CDispatchHelp::Invoke+0xaf
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for win32ole.so - 
0028f7b8 6ab901e2 wbemdisp!CSWbemDateTime::Invoke+0x3e
WARNING: Stack unwind information not available. Following frames may be wrong.
0028f8b8 6ab90bde win32ole+0x101e2
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for msvcrt-ruby200.dll - 
0028f988 668e3539 win32ole+0x10bde
0028fa18 668ef3f8 msvcrt_ruby200!rb_error_arity+0x129
0028fab8 668ef712 msvcrt_ruby200!rb_f_send+0x518
0028fb38 668e7fdd msvcrt_ruby200!rb_f_send+0x832
0028fc68 668ec1a4 msvcrt_ruby200!rb_vm_localjump_error+0x260d
0028fd78 668f4f71 msvcrt_ruby200!rb_vm_localjump_error+0x67d4
0028fdd8 667c4651 msvcrt_ruby200!rb_iseq_eval_main+0x131
0028fe58 667c6b3d msvcrt_ruby200!rb_check_copyable+0x37d1
*** ERROR: Module load completed but symbols could not be loaded for ruby.exe
0028fe88 0040287f msvcrt_ruby200!ruby_run_node+0x2d
0028feb8 004013fa ruby+0x287f
0028ff80 77087c04 ruby+0x13fa
0028ff94 7765ad1f kernel32!BaseThreadInitThunk+0x24
0028ffdc 7765acea ntdll!__RtlUserThreadStart+0x2f
0028ffec 00000000 ntdll!_RtlUserThreadStart+0x1b

   1  Id: 90c.94 Suspend: 0 Teb: 7ffda000 Unfrozen
ChildEBP RetAddr  
022cfec8 77452cc7 ntdll!NtWaitForSingleObject+0xc
022cff3c 77452c02 KERNELBASE!WaitForSingleObjectEx+0x99
022cff50 668fe676 KERNELBASE!WaitForSingleObject+0x12
WARNING: Stack unwind information not available. Following frames may be wrong.
022cff80 77087c04 msvcrt_ruby200!rb_thread_list+0x10d6
022cff94 7765ad1f kernel32!BaseThreadInitThunk+0x24
022cffdc 7765acea ntdll!__RtlUserThreadStart+0x2f
022cffec 00000000 ntdll!_RtlUserThreadStart+0x1b

OK, so the ~*k semantics were probably obvious but remember we are 2. Lets walk through each character:

~ - Typing this alone would have simply listed 2 lines: each thread and their status.

* - This returned all threads. We could have supplied a number instead to get a single thread like ~0k to get just thread 0.

k - this produced the call stack.

There are more single character commands we can use after the 'k' each equally intuitive and discoverable. We will look at a couple in just a bit.

So here we have call stacks for each thread with a line for each frame in the stack. One can kind of groc the gist of the execution here: a bunch of ruby C calls and then an RPC call is sent and it waits for that call to return. The two addresses at the front of each frame are likely not much help for us. They refer to the base pointer of the frame and the return address where execution is to resume.

So now you are dissapointed. You were probably thinking you'd be seeing an actual ruby stack trace. Yeah. That would be cool. So one challenge we face is finding what actual ruby code, perhaps that we wrote, is at play here. However, if you find yourself needing to debug at this level, there are things happening outside of ruby that are of real interest.

Another thing to note is that the ruby frames seem brief compared to the COM frames above. Later we'll see how examining the COM is much easier than ruby. This is because we lack symbols for the compiled ruby runtime. I have not tried it myself, but we could compile ruby ourselves and prooduce symbols then capture a dump running on that ruby. This would likely be helpful but not practical in a production environment.

Switching to the correct bitness

Maybe you are seeing nothing like this at all. If you are seeing something like this:

  0  Id: 90c.734 Suspend: 0 Teb: 00000000`7ffdb000 Unfrozen
Child-SP          RetAddr           Call Site
00000000`0008ee08 00000000`77581e66 wow64cpu!CpupSyscallStub+0x2
00000000`0008ee10 00000000`7754219a wow64cpu!WaitForMultipleObjects32+0x1d
00000000`0008eec0 00000000`775420d2 wow64!RunCpuSimulation+0xa
00000000`0008ef10 00007ffc`7f383a15 wow64!Wow64LdrpInitialize+0x172
00000000`0008f450 00007ffc`7f362f1e ntdll!LdrpInitializeProcess+0x1591
00000000`0008f770 00007ffc`7f2d8ebe ntdll!_LdrpInitialize+0x8a00e
00000000`0008f7e0 00000000`00000000 ntdll!LdrInitializeThunk+0xe

This is what a 32 bit thread looks like in a 64 bit dump. Thats ok. You can switch inside of this same dump to 32 bit mode by running:

!wow64exts.sw

Now run ~0k and you will see a stack like the larger one before.

Going deeper into the stack

So I'll be honest, the ruby frames here make no sense to me. This may be due to the lack of symbols but the last ruby frame would indicate that an argument error is being thrown. However it is called by rb_f_send which does not call rb_error_arity according to the ruby source and rb_error_arity does not call into win32ole. So I don't know whats going on here. Now I do know that the ruby win32ole class manages WMI queries and I also know that wbem is the prefix of the wmi scripting library in windows. Further, WMI queries are handled out of process which would explain the RPC call. In this case, the hang is occuring in a Chef client run but before the cookbook recipes actually run. This is when chef "ohai" data is collected which is the data that describes the current state of a machine and windows nodes leverage WMI for much of that data.

So is there any more info that might lend us some clues here? Well one step we can take is to query the thread 0 stack again and this time try to get the actual parameters being passed into the function calls. Thats done by adding...wait...I wont tell...you guess...

...thats right! Of course...'b'. We add b to the ~0k command so we run ~0kb:

0:000> ~0kb
ChildEBP RetAddr  Args to Child              
0028ea2c 7746112f 00000002 0028ebe0 00000001 ntdll!NtWaitForMultipleObjects+0xc
0028ebb8 752dd433 00000002 0028ebe0 00000000 KERNELBASE!WaitForMultipleObjectsEx+0xcc
0028ec1c 76f2fab4 00000001 0028ec80 ffffffff user32!MsgWaitForMultipleObjectsEx+0x163
0028ec54 7702b50c 0028ec80 00000001 0028ec7c combase!CCliModalLoop::BlockFn+0x111 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1571]
(Inline) -------- -------- -------- -------- combase!ModalLoop+0x52 [d:\9147\com\combase\dcomrem\chancont.cxx @ 129]
(Inline) -------- -------- -------- -------- combase!ClassicSTAThreadWaitForCall+0x52 [d:\9147\com\combase\dcomrem\threadtypespecific.cpp @ 172]
0028ecc0 7702a259 0028ef84 005c9c54 00505170 combase!ThreadSendReceive+0x1d3 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5776]
(Inline) -------- -------- -------- -------- combase!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xd7 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5090]
0028ee0c 76f2fe0d 005c9c54 0028ef84 0028ef4c combase!CRpcChannelBuffer::SendReceive2+0x1e9 [d:\9147\com\combase\dcomrem\channelb.cxx @ 4796]
(Inline) -------- -------- -------- -------- combase!ClientCallRetryContext::SendReceiveWithRetry+0x31 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1090]
(Inline) -------- -------- -------- -------- combase!CAptRpcChnl::SendReceiveInRetryContext+0x3b [d:\9147\com\combase\dcomrem\callctrl.cxx @ 715]
0028eecc 76f0c65d 005c9c54 0028ef84 0028ef4c combase!ClassicSTAThreadSendReceive+0x21d [d:\9147\com\combase\dcomrem\callctrl.cxx @ 696]
(Inline) -------- -------- -------- -------- combase!CAptRpcChnl::SendReceive+0x89 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 752]
0028ef30 7702a010 005c9c54 0028ef84 0028ef4c combase!CCtxComChnl::SendReceive+0x105 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 790]
0028ef54 76dc5769 02b0f2c4 0028efb0 76dc5740 combase!NdrExtpProxySendReceive+0x5c [d:\9147\com\combase\ndr\ndrole\proxy.cxx @ 2017]
0028ef6c 76e46c1b 29b71661 76f113e0 0028f3d0 rpcrt4!NdrpProxySendReceive+0x29
0028f398 77029e1e 74ba39b0 74ba4ce0 0028f3d0 rpcrt4!NdrClientCall2+0x22b
0028f3b8 76f0c46f 0028f3d0 00000003 0028f440 combase!ObjectStublessClient+0x6c [d:\9147\com\combase\ndr\ndrole\i386\stblsclt.cxx @ 215]
0028f3c8 7494a5c6 02b0f2c4 005bc634 ffffffff combase!ObjectStubless+0xf [d:\9147\com\combase\ndr\ndrole\i386\stubless.asm @ 171]
0028f440 74d0627f 005bc690 ffffffff 00000001 fastprox!CEnumProxyBuffer::XEnumFacelet::Next+0xd6
0028f484 76cca040 02b7b668 0028f4f8 0051197c wbemdisp!CSWbemObjectSet::get_Count+0xdf
0028f4a0 76cca357 02b7b668 00000024 00000004 oleaut32!DispCallFunc+0x16f
0028f748 74cfe81f 004f54ec 02b7b668 00000001 oleaut32!CTypeInfo2::Invoke+0x2d7
0028f784 74d0876e 02b7b678 00000001 6ab991f4 wbemdisp!CDispatchHelp::Invoke+0xaf
0028f7b8 6ab901e2 02b7b668 00000001 6ab991f4 wbemdisp!CSWbemDateTime::Invoke+0x3e
WARNING: Stack unwind information not available. Following frames may be wrong.
0028f8b8 6ab90bde 00000003 00000000 01dc6590 win32ole+0x101e2
0028f988 668e3539 00000001 00433098 02386500 win32ole+0x10bde
0028fa18 668ef3f8 0262556e 02487ce8 00000015 msvcrt_ruby200!rb_error_arity+0x129
0028fab8 668ef712 0043308c 00433088 00000004 msvcrt_ruby200!rb_f_send+0x518
0028fb38 668e7fdd 00375750 004b2f30 02419480 msvcrt_ruby200!rb_f_send+0x832
0028fc68 668ec1a4 01d5af70 00000004 02487ed5 msvcrt_ruby200!rb_vm_localjump_error+0x260d
0028fd78 668f4f71 00000001 00000000 01dda1b8 msvcrt_ruby200!rb_vm_localjump_error+0x67d4
0028fdd8 667c4651 02522f58 00000000 ffffffff msvcrt_ruby200!rb_iseq_eval_main+0x131
0028fe58 667c6b3d 0028fe7c 00000000 0028fe88 msvcrt_ruby200!rb_check_copyable+0x37d1
0028fe88 0040287f 02522f58 00372cb8 0028ff80 msvcrt_ruby200!ruby_run_node+0x2d
0028feb8 004013fa 00000002 00372cb8 00371550 ruby+0x287f
0028ff80 77087c04 7ffde000 77087be0 28433cec ruby+0x13fa
0028ff94 7765ad1f 7ffde000 28241b9c 00000000 kernel32!BaseThreadInitThunk+0x24
0028ffdc 7765acea ffffffff 7764024a 00000000 ntdll!__RtlUserThreadStart+0x2f
0028ffec 00000000 004014e0 7ffde000 00000000 ntdll!_RtlUserThreadStart+0x1b

This adds three more memory addresses to each frame and they point to the first three arguments passed to the function on the stack.

So these may be hex representations of actual values of pointers to other data structures. If we have symbols for the code being called, things get alot easier but lets look at a ruby call first. I'd like to check out the args passed to the last ruby call in the stack.

Examining raw address values

So the command to display memory in windbg is 'd'. I know, thats really kind of a given but remember we are 2. Also, its not just d. d is always accompanied by another character to let windbg know how to display that value. Here are some I find myself using alot:

dd - DWORD (4bytes) by default it spits out 32 of them but you can limit the number with the L parameter (ex. dd 7746112f L2)
dc - Same output as dd but adds ascii representation to the right
da - displays the address memory as a ascii string
du - displays the memory as a unicode string

There is another one, dt, that is super helpful and we will look at that in just a bit.

So one thing I am curious about here are the arguments passed to the last ruby frame:

0028fa18 668ef3f8 0262556e 02487ce8 00000015 msvcrt_ruby200!rb_error_arity+0x129

Remember, its the 2nd - 4th addresses that contain the first three arguments. You always see 3 addresses regardless of the number of arguments the method takes so these may point to nowhere useful. Looking at the first argument dc 0262556e produces:

0:000> dc 0262556e 
0262556e  336e6977 6e705f32 67697370 6464656e  win32_pnpsignedd
0262557e  65766972 26340072 5f564552 335c3130  river.4&REV_01\3
0262558e  37363226 36313641 26332641 00003833  &267A616A&3&38..
0262559e  3d740241 1b8612c8 43508000 45565c49  A.t=......PCI\VE
026255ae  30385f4e 44263638 375f5645 26323931  N_8086&DEV_7192&
026255be  53425553 305f5359 30303030 26303030  SUBSYS_00000000&
026255ce  5f564552 335c3330 37363226 36313641  REV_03\3&267A616
026255de  26332641 00003030 3d7d0000 1c011331  A&3&00....}=1...

OK, this is interesting and curious. win32_pnpsigneddriver is a valid WMI class. However, this would not be a valid argument for rb_error_arity which takes 3 ints. So now I'm just dismissing the entire ruby stack to lack of debug symbols, but the I feel confident a WMI call is a factor. Here is another clue: the last argument on the second win32ole frame:

0:000> dc 02386500 
02386500  0000000c 02361db0 00000000 6ab834f0  ......6......4.j
02386510  0241c9d8 00000000 09402005 01d5a820  ..A...... @. ...
02386520  00000023 02571888 00000023 00000000  #.....W.#.......
02386530  0000a007 01d528a0 02386560 00000000  .....(..`e8.....
02386540  00000000 00000000 09424005 01d5a820  .........@B. ...
02386550  63657845 72657551 00000079 00000000  ExecQuery.......
02386560  00502005 01d5a820 00000023 02625560  . P. ...#...`Ub.
02386570  0000003e 00000000 08202005 01d5a820  >........  . ...

ExecQuery is the WMI method name that performs queries on WMI classes.

So searching the chef ohai gem I come accross:

drivers = wmi.instances_of('Win32_PnPSignedDriver')

This instances_of method uses the wmi-lite gem calling ExecQuery on the locator returned from:

locator = WIN32OLE.new("WbemScripting.SWbemLocator")

So now I have a pretty good idea what application code is triggering the hang. As I stated earlier in this post, WMI queries create an RPC call to a process named WmiPrvSE.exe. Its not uncommon to see more than one of these processes on a machine. So now I'm wondering if it might be possible to track which of these processes this thread is waiting on. This gets us to a couple other useful windbg commands.

Navigating types

So we do have symbols for the COM code handling the RPC message. That being the case, there is another stack trace command we can use to get us the parameters in a more friendly way - ~0kP

0:000> ~0kP
ChildEBP RetAddr  
0028ea2c 7746112f ntdll!NtWaitForMultipleObjects+0xc
0028ebb8 752dd433 KERNELBASE!WaitForMultipleObjectsEx+0xcc
0028ec1c 76f2fab4 user32!MsgWaitForMultipleObjectsEx+0x163
0028ec54 7702b50c combase!CCliModalLoop::BlockFn(
            void ** ahEvent = 0x0028ec80, 
            unsigned long cEvents = 1, 
            unsigned long * lpdwSignaled = 0x0028ec7c)+0x111 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1571]
(Inline) -------- combase!ModalLoop+0x52 [d:\9147\com\combase\dcomrem\chancont.cxx @ 129]
(Inline) -------- combase!ClassicSTAThreadWaitForCall+0x52 [d:\9147\com\combase\dcomrem\threadtypespecific.cpp @ 172]
0028ecc0 7702a259 combase!ThreadSendReceive(
            class CMessageCall * pCall = 0x00505170, 
            struct _GUID * rmoid = 0x00509fb4 {70622196-bea4-9302-ac70-cabc069d1e23})+0x1d3 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5776]
(Inline) -------- combase!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xd7 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5090]
0028ee0c 76f2fe0d combase!CRpcChannelBuffer::SendReceive2(
            struct tagRPCOLEMESSAGE * pMessage = 0x0028ef84, 
            unsigned long * pstatus = 0x0028ef4c)+0x1e9 [d:\9147\com\combase\dcomrem\channelb.cxx @ 4796]
(Inline) -------- combase!ClientCallRetryContext::SendReceiveWithRetry+0x31 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 1090]
(Inline) -------- combase!CAptRpcChnl::SendReceiveInRetryContext+0x3b [d:\9147\com\combase\dcomrem\callctrl.cxx @ 715]
0028eecc 76f0c65d combase!ClassicSTAThreadSendReceive(
            class CAptRpcChnl * pChannel = 0x005c9c54, 
            struct tagRPCOLEMESSAGE * pMsg = 0x0028ef84, 
            unsigned long * pulStatus = 0x0028ef4c)+0x21d [d:\9147\com\combase\dcomrem\callctrl.cxx @ 696]
(Inline) -------- combase!CAptRpcChnl::SendReceive+0x89 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 752]
0028ef30 7702a010 combase!CCtxComChnl::SendReceive(
            struct tagRPCOLEMESSAGE * pMessage = 0x0028ef84, 
            unsigned long * pulStatus = 0x0028ef4c)+0x105 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 790]
0028ef54 76dc5769 combase!NdrExtpProxySendReceive(
            void * pThis = 0x02b0f2c4, 
            struct _MIDL_STUB_MESSAGE * pStubMsg = 0x0028efb0)+0x5c [d:\9147\com\combase\ndr\ndrole\proxy.cxx @ 2017]
0028ef6c 76e46c1b rpcrt4!NdrpProxySendReceive+0x29
0028f398 77029e1e rpcrt4!NdrClientCall2+0x22b
0028f3b8 76f0c46f combase!ObjectStublessClient(
            void * ParamAddress = 0x0028f3d0, 
            long Method = 0n3)+0x6c [d:\9147\com\combase\ndr\ndrole\i386\stblsclt.cxx @ 215]
0028f3c8 7494a5c6 combase!ObjectStubless(void)+0xf [d:\9147\com\combase\ndr\ndrole\i386\stubless.asm @ 171]
0028f440 74d0627f fastprox!CEnumProxyBuffer::XEnumFacelet::Next+0xd6
0028f484 76cca040 wbemdisp!CSWbemObjectSet::get_Count+0xdf
0028f4a0 76cca357 oleaut32!DispCallFunc+0x16f
0028f748 74cfe81f oleaut32!CTypeInfo2::Invoke+0x2d7
0028f784 74d0876e wbemdisp!CDispatchHelp::Invoke+0xaf
0028f7b8 6ab901e2 wbemdisp!CSWbemDateTime::Invoke+0x3e
WARNING: Stack unwind information not available. Following frames may be wrong.
0028f8b8 6ab90bde win32ole+0x101e2
0028f988 668e3539 win32ole+0x10bde
0028fa18 668ef3f8 msvcrt_ruby200!rb_error_arity+0x129
0028fab8 668ef712 msvcrt_ruby200!rb_f_send+0x518
0028fb38 668e7fdd msvcrt_ruby200!rb_f_send+0x832
0028fc68 668ec1a4 msvcrt_ruby200!rb_vm_localjump_error+0x260d
0028fd78 668f4f71 msvcrt_ruby200!rb_vm_localjump_error+0x67d4
0028fdd8 667c4651 msvcrt_ruby200!rb_iseq_eval_main+0x131
0028fe58 667c6b3d msvcrt_ruby200!rb_check_copyable+0x37d1
0028fe88 0040287f msvcrt_ruby200!ruby_run_node+0x2d
0028feb8 004013fa ruby+0x287f
0028ff80 77087c04 ruby+0x13fa
0028ff94 7765ad1f kernel32!BaseThreadInitThunk+0x24
0028ffdc 7765acea ntdll!__RtlUserThreadStart+0x2f
0028ffec 00000000 ntdll!_RtlUserThreadStart+0x1b

Maybe still not the most beautiful output, but look at those combase calls. We have types and parameter names! Amazing! So to get the process ID of the service handling the call, we want to look at the CMessageCall passed to ThreadSendReceive. We can dig into CMessageCall using the dt command. It takes the symbol to inspect and its address:

0:000> dt CMessageCall 0x00505170
Ambiguous matches found for CMessageCall (dumping largest sized):
    combase!CMessageCall {0x004 bytes}
    combase!CMessageCall {0x100 bytes}
combase!CMessageCall
   +0x000 __VFN_table : 0x76efb4c0 
   =7703e09c s_uCurrentCallTraceId : 0x23ac7b
   +0x004 __VFN_table : 0x76efb53c 
   +0x008 __VFN_table : 0x76efb52c 
   +0x00c _pNextMessage    : (null) 
   =76ef0000 Kind             : 0x905a4d (No matching name)
   +0x010 _callcat         : 1 ( CALLCAT_SYNCHRONOUS )
   +0x014 _iFlags          : 0x202
   +0x018 _hResult         : 0n0
   +0x01c _hEvent          : 0x00000260 Void
   +0x020 _hWndCaller      : (null) 
   +0x024 _ipid            : _GUID {0009042d-030c-0000-5ed7-a7ed59ca66b0}
   +0x034 _hSxsActCtx      : 0xffffffff Void
   +0x038 _server_fault    : 0
   +0x03c _destObj         : CDestObject
   +0x048 _pHeader         : 0x005bfdb8 Void
   +0x04c _pLocalThis      : 0x005bfdd8 WireLocalThis
   +0x050 _pLocalThat      : (null) 
   +0x054 _pHandle         : 0x005941a0 CChannelHandle
   +0x058 _hRpc            : 0x02b68f18 Void
   +0x05c _pContext        : (null) 
   +0x060 _dwCOMHeaderSize : 0x30
   +0x064 message          : tagRPCOLEMESSAGE
   +0x090 _iidWire         : _GUID {423ec01e-2e35-11d2-b604-00104b703efd}
   +0x0a0 _guidCausality   : _GUID {1a786d47-0267-499f-b9e9-6678d3e29858}
   +0x0b0 _dwServerPid     : 0x30c
   +0x0b4 _dwServerTid     : 0
   +0x0b8 _iMethodWire     : 3
   +0x0bc _dwErrorBufSize  : 0
   +0x0c0 _pASTAIncomingCallData : (null) 
   +0x0c4 m_ulCancelTimeout : 0xffffffff
   +0x0c8 m_dwStartCount   : 0xfb0049
   +0x0d0 m_pClientCtxCall : (null) 
   +0x0d4 m_pServerCtxCall : (null) 
   +0x0d8 _bCheckedClientIsAppContainer : 0y0
   +0x0d8 _bClientIsAppContainer : 0y0
   +0x0d8 _bServerIsAppContainer : 0y0
   +0x0d8 _bCheckedClientIsRpcss : 0y0
   +0x0d8 _bClientIsRpcss  : 0y0
   +0x0d8 _bCheckedClientIsStrongNamedProcess : 0y0
   +0x0d8 _bClientIsStrongNamedProcess : 0y0
   +0x0d8 _bClientCalledGetBuffer : 0y1
   +0x0d9 _bClientCalledSendOrSendReceive : 0y1
   +0x0d9 _bClientCalledReceiveOrSendReceive : 0y1
   +0x0d9 _bClientReceiveOrSendReceiveCompletedSuccessfully : 0y0
   +0x0d9 _bSuppressClientStopTracing : 0y0
   +0x0dc _uCallTraceId    : 0x23ac7b
   +0x0e0 _hrClient        : 0
   +0x0e4 _sourceOfHResult : 0 ( Unknown )
   +0x0e8 _pInBiasContainer : (null) 
   +0x0ec _pOutBiasContainer : (null) 
   +0x0f0 _pushTlsMitigation : PushTlsPreventRundownMitigation
   +0x0f8 _bEnabledInprocOutParameterMitigation : 0y0
   +0x0f8 _bCheckedConditionsForOutofprocOutParameterMitigation : 0y0
   +0x0f8 _bShouldEnableOutofprocOutParameterMitigation : 0y0
   +0x0f8 _bScheduledAcknowlegmentOfOutParameterMarshalingSet : 0y0
   +0x0fc _pServerOXIDEntry : 0x004fbe68 OXIDEntry

So the key property here is _dwServerPid with a value of 0x30c, which translates to a decimal value of 780. This was a svchost.exe process which can host several windows services. This process was hosting the Windows Management Instrumentation service.

To be honest at this point my energy to dive into that process to see if I could find which spawned WmiPrvSE.exe was at fault had fizzled. There had been alot of other troubleshooting beyond what we have looked at here leading me to believe that there was likely nothing helpful on the WmiPrvSE side that would be useful. The actionable bit of data I could walk away from here is to include a timeout in the wmi-lite query so that if WMI queries hang in the future, the process will at least fail allowing subsequent chef runs to proceed.

Before ending this post, I would like to cover a couple other points that may be helpful in other investigations.

Waiting handles

One dead giveaway from this stack that things are stuck is the WaitForMultipleObjects call at the very top.

0028ea2c 7746112f 00000002 0028ebe0 00000001 ntdll!NtWaitForMultipleObjects+0xc

Just google this name and you will find reams of posts, some well worth the read, talking about hangs. We can even examine the waiting handle. Its the second parameter sent to that call. First we need the handle address that argument points to. We will use dd 0028ebe0 L1 to get the very first dword:

0:000> dd 0028ebe0 L1
0028ebe0  00000260

This gives us the handle address which we retrive using the !handle command:

0:000> !handle 00000260  f
Handle 00000260
  Type          Event
  Attributes    0
  GrantedAccess 0x1f0003:
         Delete,ReadControl,WriteDac,WriteOwner,Synch
         QueryState,ModifyState
  HandleCount   2
  PointerCount  35085
  Name          <none>
  Object specific information
    Event Type Auto Reset
    Event is Waiting

You definitely want to use the f parameter other wise you just get the handle type. So we see here the thread is waiting.

Locks, lock owners and orphans

In another hang I had some dumps of the WmiPrvSE processes on the machine at the time that ruby was hanging. In this dump was the following thread:

   3  Id: 1dd0.293c Suspend: 0 Teb: 7f28f000 Unfrozen
ChildEBP RetAddr  Args to Child              
0385dfa4 7724df87 00000380 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
0385e018 7724de0b 00000001 0366e230 0366e260 ntdll!RtlpWaitOnCriticalSection+0xd0
0385e040 7724de35 0385e064 73f33450 0366e260 ntdll!RtlpEnterCriticalSectionContended+0xa0
0385e048 73f33450 0366e260 02e1bb48 0366e230 ntdll!RtlEnterCriticalSection+0x42
0385e064 73f33721 3a369589 00000000 00000000 signdrv!CWhqlObj::WalkTree+0x18
0385e930 73f3364d 02df2da8 02e1bb48 00000000 signdrv!CWhqlObj::CreateList+0x4b
0385e97c 00106d8a 0366e238 02e1b394 00000000 signdrv!CWhqlObj::CreateInstanceEnumAsync+0xdd
0385e9d8 00106a26 00000000 02dc50b4 00000000 WmiPrvSE!CInterceptor_IWbemSyncProvider::Helper_CreateInstanceEnumAsync+0x2c1
0385ea34 75c8676f 02dd8b70 02dc50b4 00000000 WmiPrvSE!CInterceptor_IWbemSyncProvider::CreateInstanceEnumAsync+0xf6
0385ea5c 75d071fc 00106930 0385eca8 00000005 rpcrt4!Invoke+0x34
0385f0e0 74fa868d 02dbd098 02dc4154 02db95b4 rpcrt4!NdrStubCall2+0x32c
0385f12c 75c6147c 02dbd098 02db95b4 02dc4154 combase!CStdStubBuffer_Invoke+0x99 [d:\9147\com\combase\ndr\ndrole\stub.cxx @ 1590]
0385f148 74072f08 02dbd098 02db95b4 02dc4154 rpcrt4!CStdStubBuffer_Invoke+0x2c
0385f164 74fa853f 02dbda6c 02db95b4 02dc4154 fastprox!CBaseStublet::Invoke+0x38
0385f1f0 74e8c086 02dbb228 000f3888 02dc4154 combase!SyncStubInvoke+0x14c [d:\9147\com\combase\dcomrem\channelb.cxx @ 1664]
(Inline) -------- -------- -------- -------- combase!StubInvoke+0x9e [d:\9147\com\combase\dcomrem\channelb.cxx @ 1957]
0385f31c 74fa9600 02dc4154 02db95b4 02dbda6c combase!CCtxComChnl::ContextInvoke+0x236 [d:\9147\com\combase\dcomrem\ctxchnl.cxx @ 1377]
(Inline) -------- -------- -------- -------- combase!DefaultInvokeInApartment+0xffffffe8 [d:\9147\com\combase\dcomrem\callctrl.cxx @ 2716]
0385f3c4 74fa8de0 02dbda6c 02dd8b70 02de2de0 combase!AppInvoke+0x415 [d:\9147\com\combase\dcomrem\channelb.cxx @ 1481]
0385f520 74fa964f 74fa9710 00000000 02dc0ed0 combase!ComInvokeWithLockAndIPID+0x605 [d:\9147\com\combase\dcomrem\channelb.cxx @ 2311]
0385f57c 75c867cd 02dc0ed0 3c14e0c5 00000000 combase!ThreadInvoke+0x694 [d:\9147\com\combase\dcomrem\channelb.cxx @ 5488]
0385f5c0 75c8722e 74fa9710 02dc0ed0 0385f670 rpcrt4!DispatchToStubInCNoAvrf+0x4d
0385f630 75c86ff7 02dc0ed0 00000000 00000000 rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x13e
0385f6c4 75c86d3c 0385f6f4 00000000 02dc0e18 rpcrt4!LRPC_SCALL::DispatchRequest+0x226
0385f720 75c86a14 02de2d48 02de3e50 00000000 rpcrt4!LRPC_SCALL::HandleRequest+0x31c
0385f75c 75c8538f 02de2d48 02de3e50 00000000 rpcrt4!LRPC_SASSOCIATION::HandleRequest+0x1fc
0385f824 75c85101 00000000 75c85070 02d9a0c0 rpcrt4!LRPC_ADDRESS::ProcessIO+0x481
0385f860 77250b10 0385f8e4 02db39c4 02d9a0c0 rpcrt4!LrpcIoComplete+0x8d
0385f898 77250236 0385f8e4 02d9a0c0 00000000 ntdll!TppAlpcpExecuteCallback+0x180
0385fa38 755d7c04 02d981d8 755d7be0 3cb70784 ntdll!TppWorkerThread+0x33c
0385fa4c 7728ad1f 02d981d8 3ee82643 00000000 kernel32!BaseThreadInitThunk+0x24
0385fa94 7728acea ffffffff 7727022d 00000000 ntdll!__RtlUserThreadStart+0x2f
0385faa4 00000000 77244a00 02d981d8 00000000 ntdll!_RtlUserThreadStart+0x1b

This thread jumped out at me because it was waiting to obtain a lock to run a critical section. We can get more info on the critical section by using the !critsec command with the address of the first parameter passed to RtlEnterCriticalSection:

0:000> !critsec 0366e260

CritSec +366e260 at 0366e260
WaiterWoken        No
LockCount          1
RecursionCount     1
OwningThread       2f84
EntryCount         0
ContentionCount    1
*** Locked

We can also use the !locks command:

0:000> !locks

CritSec +366e260 at 0366e260
WaiterWoken        No
LockCount          1
RecursionCount     1
OwningThread       2f84
EntryCount         0
ContentionCount    1
*** Locked

Scanned 13 critical sections

So the kind of obvious next question one might ask themselves now is: what is thread 2f84 doing? Probably nothing productive. However, a listing of threads:

000> ~
.  0  Id: 1dd0.9b4 Suspend: 0 Teb: 7f3be000 Unfrozen
Id: 1dd0.2c64 Suspend: 0 Teb: 7f3bb000 Unfrozen
Id: 1dd0.3148 Suspend: 0 Teb: 7f289000 Unfrozen
Id: 1dd0.293c Suspend: 0 Teb: 7f28f000 Unfrozen

shows that thread has gone missing. This probably means one of two things:

We are in the middle of a context switch and owner 2f84 has finished.
2f84 encountered an access violation and died leaving our thread here orphaned. Tragic but very possible.

Coincidentally, more inspection of this thread shows that it was doing something with our WMI class friend Win32_PNPSignedDriver. This very well could have been why the ruby thread in this case was hanging. It was hard to say because other hangs did not reveal this same pattern.

So I know I've looked at just about enough memory dumps to last me a good while and I hope that this has been equally satisfying to other readers.

Oh...and...hi future self! I'm thinking about you and hoping you a quick investigation.

I'm so excited. I'm about to join Chef and I think I like it! / August 29, 2015 by Matt Wrock

A little over a year ago, a sparkly devops princess named Jennifer Davis came to my workplace at CenturyLink Cloud and taught me a thing or two about this thing she called Chef. I had been heads down for the previous 3 years trying to automate windows. While wandering in the wilderness of windows automation I had previously bumped into Chocolatey where I met my good friend Rob Reynolds who had told me stories about these tools called Puppet, Chef, and Vagrant. Those were the toys that the good little girls and boys on the other side of the mountain (ie linux land) used for their automation. Those tools sounded so awesome but occupied with my day job at Microsoft that was not centered around automation and my nights and weekends developing Boxstarter, I had yet to dabble in these tools myself.

So when the opportunity came to join a cloud computing provider CenturyLink Cloud and focus on data center automation for a mixed windows and linux environment, I jumped at the opportunity. Thats where Jennifer entered the scene and opened a portal door leading from my desert to a land of automation awe and wonder and oh the things I saw. Entire communities built, thriving and fighting all about transforming server configuration from lines of code. To these people, "next" and "finish" were commands and not buttons. These were my people and with them I wanted to hang.

There were windows folk I spied through the portal. Jeffrey Snover had been there for quite some time in full robe and sandals pulling windows folk kicking and screaming through to the other side of the portal. Steven Murauwski was there too translating Jeffrey's teachings and pestering Jeffrey's disciples to write tests because that's how its done. Rob had been sucked through the portal door at least a year prior and was working for Puppet. Hey look he's carrying a Mac just like all the others. Is that some sort of talisman? Is it the shared public key needed to unveil their secrets? I don't know...but its so shiny.

So what the heck, I thought. Tonight we'll put all other things aside and I followed Jennifer through the portal and never looked back. There was much ruby to be learned and written. Its not that different from powershell and had many similarities to C# but it was indeed its own man and would occasionally hit back when I cursed it. One thing I found was that even more fun that using Chef to automate CenturyLink data centers, was hacking on and around the automation tooling like Chef, Vagrant, Ruby's WinRM implementation, Test-Kitchen, etc.

There was so much work to be done here to build a better bridge between windows and the automation these tools had to offer. This was so exciting and Chef seemed to be jumping all in to this effort as if proclaiming I know, I know, I know, I know, I want you, I want you. Before long Steven took up residence with Chef and Snover seemed to be coming over for regular visits where he was working with Chef to feed multitudes on just a couple loaves of DSC. I helped a little here and there along with a bunch of other smart and friendly people.

Well meanwhile on the CenturyLink farm, we made alot of progress and added quite a bit of automation to building out our data centers. Recently things have been winding down for me on that effort and it seemed a good time to asses my progress and examine new possibilities. Then it occured to me, if Jennifer can be a sparkly devops princess at chef, could I be one too?

We may not yet know the answer to that question yet but I intend to explore it and enter into the center sphere of its celestial force.

So I'll see my #cheffriends Monday. I'm so excited and I just cant hide it!

A dirty check for Chef cookbooks: testing everything that has changed and only when it is changed / August 22, 2015 by Matt Wrock

I was recently pinged by a #cheffriend of mine, Alex Vinyar. A few months back, I sent him a bit of code that I wrote at CenturyLink Cloud that runs on our build agents to determine which cookbooks to test after a cookbook commit is pushed to source control. He just mentioned to me that he has gotten some positive feedback on that code so I thought I'd blog on the details of that for others who may be interested. This can also be considered a follow up to my post from almost a year ago on cookbook dependency management where I shared some theory I was pondering at the time on this subject. This post covers how I followed up on that theory with working code.

I'm going to cover two key strategies:

Testing all cookbooks that depend on a changed cookbook
Ensuring that once a cookbook is tested and succeeds, it is never tested again as long as it and its dependencies do not change

I'll share the same code I shared with Alex (and more) in a bit.

Not enough to test just the cookbook changed

Lets consider this scenario: You manage infrastructure for a complex distributed application. You maintain about a baker's dozen cookbooks - see what I did there?...baker...chef...look at me go girl! You have just committed and pushed a change to one of your lower level library cookbooks that exposes functions used by several of your other cookbooks. We'd expect your pipeline to run tests on that library cookbook but is that enough? What if your upstream cookbooks that depend on the library cookbook "blow up" when they converge with the new changes. This is absolutely possible and not just for library cookbooks.

I wanted a way for a cookbook commit to trigger not only its own chefspecs and test-kitchen tests but also the tests of other cookbooks that depend on it.

Kitchen tests provide slow but valuable feedback, don't run more than needed

Here is another scenario: You are now testing the entire dependency tree touched by a change...yay! But failure happens and your build fails. However the failure may not have occurred at the root of your tree (the cookbook that actually changed). Here are some possibilities:

Maybe there was some transient network glitch that caused a upstream cookbook to fail
Gem dependency changes outside of your own managed cookbook collection (you know - the dark of semver - minor build bumps that are really "major")
Your build has batched more than one root cookbook change. One breaks but the others pass

So an hour's worth (or more) of testing 10 cookbooks results in a couple broken kitchen tests. You make some fixes, push again and have to wait for the entire set to build/test all over again.

I have done this before. Just look at my domain name!

However many of these tests may be unnecessary now. Some of the tests that has succeeded in the filed batch may be unaffected by the fix.

Finding dependencies

So here is the class we use at CenturyLink Cloud to determine when one cookbook as been affected by a recent change. This is what I sent to Alex. I'll walk you through some of it later.

require 'berkshelf'
require 'chef'
require 'digest/sha1'
require 'json'

module Promote
  class Cookbook
    def initialize(cookbook_name, config)
      @name = cookbook_name
      @config = config
      @metadata_path = File.join(path, 'metadata.rb')
      @berks_path = File.join(path, "Berksfile")
    end

    # List of all dependent cookbooks including transitive dependencies
    #
    # @return [Hash] a hash keyed on cookbook name containing versions
    def dependencies
      @dependencies ||= get_synced_dependencies_from_lockfile
    end

    # List of cookbooks dependencies explicitly specified in metadata.rb
    #
    # @return [Hash] a hash keyed on cookbook name containing versions
    def metadata_dependencies
      metadata.dependencies
    end

    def version(version=nil)
      metadata.version
    end

    def version=(version)
      version_line = raw_metadata[/^\s*version\s.*$/]
      current_version = version_line[/('|").*("|')/].gsub(/('|")/,"")
      if current_version != version.to_s
        new_version_line = version_line.gsub(current_version, version.to_s)
        new_content = raw_metadata.gsub(version_line, new_version_line)
        save_metadata(new_content)
      end
      version
    end

    # Stamp the metadata with a SHA1 of the last git commit hash in a comment
    #
    # @param [String] SHA1 hash of commit to stamp
    def stamp_commit(commit)
      commit_line = "#sha1 '#{commit}'"
      new_content = raw_metadata.gsub(/#sha1.*$/, commit_line)
      if new_content[/#sha1.*$/].nil?
        new_content += "\n#{commit_line}"
      end
      save_metadata(new_content)
    end

    def raw_metadata
      @raw_metadata ||= File.read(@metadata_path)
    end

    def metadata
      @metadata ||=get_metadata_from_file
    end

    def path
      File.join(@config.cookbook_directory, @name)
    end

    # Performs a berks install if never run, otherwise a berks update
    def sync_berksfile(update=false)
      return if berksfile.nil?

      update = false unless File.exist?("#{@berks_path}.lock")
      update ? berksfile_update : berksfile.install
    end

    # Determine if our dependency tree has changed after a berks update
    #
    # @return [TrueClass,FalseClass]
    def dependencies_changed_after_update?
      old_deps = dependencies
      sync_berksfile(true)
      new_deps = get_dependencies_from_lockfile

      # no need for further inspection is 
      # number of dependencies have changed
      return true unless old_deps.count == new_deps.count

      old_deps.each do |k,v|
        if k != @name
          return true unless new_deps[k] == v
        end
      end

      return false
    end

    # Hash of all dependencies. Guaranteed to change if 
    # dependencies of their versions change
    #
    # @param [environment_cookbook_name] Name of environmnt cookbook
    # @return [String] The generated hash
    def dependency_hash(environment_cookbook_name)
      # sync up with the latest on the server
      sync_latest_app_cookbooks(environment_cookbook_name)
      
      hash_src = ''
      dependencies.each do | k,v |
        hash_src << "#{k}::#{v}::"
      end
      Digest::SHA1.hexdigest(hash_src)
    end

    # sync our environmnt cookbook dependencies 
    # with the latest on the server
    #
    # @return [Hash] the list of cookbooks updated
    def sync_latest_app_cookbooks(environment_cookbook_name)
      result = {}

      #Get a list of all the latest cookbook versions on the server
      latest_server_cookbooks = chef_server_cookbooks(1)
      
      env_cookbook = Cookbook.new(environment_cookbook_name, @config)
      
      # iterate each cookbook in our environment cookbook
      env_cookbook.metadata_dependencies.each do |key|
        cb_key = key[0] # cookbook name

        # skip if its this cookbook or if its not on the server
        next if cb_key == @name || !latest_server_cookbooks.has_key?(cb_key)
        latest_version = latest_server_cookbooks[cb_key]['versions'][0]['version']
        
        # if the server as a later version than this cookbook
        # we do a berks update to true up with the server
        if dependencies.keys.include?(cb_key) &&
          latest_version > dependencies[cb_key].to_s
          
          berksfile_update(cb_key)
          result[cb_key] = latest_version
        end
      end

      # return the updated cookbooks
      result
    end

    private

    def chef_server_cookbooks(versions = 'all')
      Chef::Config[:ssl_verify_mode] = :verify_none
      rest = Chef::REST.new(
        @config.chef_server_url, 
        @config.node_name, 
        @config.client_key
      )
      rest.get_rest("/cookbooks?num_versions=#{versions}")
    end

    def berksfile
      @berksfile ||= get_berksfile
    end

    def get_berksfile
      return unless File.exist?(@berks_path)
      Berkshelf.set_format :null
      Berkshelf::Berksfile.from_file(@berks_path)
    end

    def save_metadata(content)
      File.open(@metadata_path, 'w') do |out|
        out << content
      end

      # force subsequent usages of metadata to refresh from file
      @metadata=nil
      @raw_metadata = nil
    end

    def get_metadata_from_file
      metadata = Chef::Cookbook::Metadata.new
      metadata.from_file(@metadata_path)
      metadata
    end

    def get_synced_dependencies_from_lockfile
      # install to make sure we have a lock file
      berksfile.install

      get_dependencies_from_lockfile
    end

    # get the version of every dependent cookbook from berkshelf
    def get_dependencies_from_lockfile
      berks_deps = berksfile.list
      deps = {}

      berks_deps.each {|dep| deps[dep.name] = dep.locked_version}
      deps
    end

    def berksfile_update(cookbooks = nil)
      cookbooks.nil? ? berksfile.update : berksfile.update(cookbooks)
      @dependencies = nil
    end
  end
end

This class is part of a larger "promote" gem that we use to promote cookbooks along our delivery pipeline. The class represents an individual cookbook and can be instantiated given an individual cookbook name and a config.

The config class (not shown here) is an extremely basic data class that encapsulates the properties of the chef environment.

Some assumptions

This code will only work given a couple conditions are true. One is fairly common and the other not so much. Even if these conditions do not exist in your environment you should be able to groc the logic.

Using Berkshelf

You are using berkshelf to manage your dependencies. This class uses the berkshelf API to find dependencies of other cookbooks.

Uning the Environment Cookbook Pattern

We use the environment cookbook pattern to govern which cookbooks are considered part of a deployable application. You can read more about this pattern here but an environment cookbook is very simple. It is just a metadata.rb and a Bekshelf.lock. The metadata file includes a depends for each of my cookbooks responsible for infrastructure used in the app. The Berkshelf.lock file uses that to build the entire tree of not only "my own" cookbooks but all third party cookbooks as well.

The dependency hash

For the purposes of this post, the dependency_hash method of the above class is the most crucial. This hash represents all cookbook dependencies and their version at the current point in time. If I were to persist this hash (more on that in a bit) and then later regenerate it with a hash that is at all different from the persisted one, I know something has changed. I don't really care where. Any change at all anywhere in a cookbook's dependency tree means it is at risk of breaking and should be tested.

The "Last Known Good" environment

Every time we test a cookbook and it passes, we store its current dependency hash in a special chef environment called LKG (last known good). This is really a synthetic environment with the same structure as any chef environment but the version constraints may look a bit strange:

{
  name: "LKG_platform",
  chef_type: "environment",
  json_class: "Chef::Environment",
  cookbook_versions: {
    p_rabbitmq: "1.1.15.2d88c011bfb691e9de0df593e3f25f1e5bf7cbba",
    p_elasticsearch: "1.1.79.2dd2472fbd47bf32db4ad83be45acb810b788a90",
    p_haproxy: "1.1.47.75dfc55b772fa0eb08b6fbdee02f40849f37ca77",
    p_win: "1.1.121.fa2a465e8b964a1c0752a8b59b4a5c1db9074ee5",
    p_couchbase: "1.1.92.325a02ab22743eccc17756539d3fac57ce5ef952",
    elasticsearch_sidec: "0.1.7.baab30d04181368cd7241483d9d86d2dffe730db",
    p_orchestration: "0.1.48.3de2368d660a4c7dfd6dba2bc2c61beeea337f70"
  }
}

We update this synthetic environment any time a cookbook test runs and passes. Here is a snippet of our Rakefile that does this:

cookbook = Promote::Cookbook.new(cookbook_name, config)
environment = Promote::EnvironmentFile.new("LKG", config)
lkg_versions = environment.cookbook_versions
cookbook_version = "#{cookbook.version.to_s}.#{hash}"

lkg_versions[cookbook_name] = cookbook_version

environment.write_cookbook_versions(lkg_versions)

This is just using the cookbook and environment classes of our Promote gem to write out the new versions and following this, the environment like all of our environments is committed to source control.

The dirty check

We have now seen how we generate a dependency hash and how we persist that on successful tests. Here is where we actually compare our current dependency hash and the last known hash. We need to determine before running integration tests on a given cookbook if it is any different than the last time we succesfully tested it. If we previously tested it and neither it nor any of its dependencies (no matter how deep) have changed, we really do not need or want to test it again.

def is_dirty(
  source_environment,
  lkg_environment,
  cookbook_name,
  dependency_hash = nil
)
  # Get the environment of the last good tests
  lkg_env_file = EnvironmentFile.new(lkg_environment, config)

  # if this cookbook is not in the last good tests, we are dirty
  unless lkg_env_file.cookbook_versions.has_key?(cookbook_name)
    return true
  end

  # get the current environment under test
  source_env_file = EnvironmentFile.new(source_environment, config)
  
  # get the version of this cookbook from the last good tests
  # it will look like 1.1.15.2d88c011bfb691e9de0df593e3f25f1e5bf7cbba
  # thi is the normal version followed by the dependency hash tested
  lkg_parts = lkg_env_file.cookbook_versions[cookbook_name].split('.')

  # Get the "plain" cookbook version
  lkg_v = lkg_parts[0..2].join('.')
  
  # the dependency hash last tested
  lkg_hash = target_parts[3]
  
  # we are dirty if our plain version or dependency hash has changed
  return source_env_file.cookbook_versions[cookbook_name] != lkg_v || 
    dependency_hash != lkg_hash
end

This method takes the environment currently under test, our LKG environment that has all the versions and their dependency hashes last tested, the cookbook to be tested and its current dependency hash. It returns whether or not our cookbook is dirty and needs testing.

Prior to calling this code, our build pipeline edits our "test" environment (the "source_environment" above). This environment holds the cookbook versions of all cookbooks being tested including the last batch of commits. We use a custom kitchen provisioner plugin to ensure Test-Kitchen uses these versions and not just the versions in each cookbooks berkshelf lock file. As I discussed in my dependency post from last year, its important that all changes are tested together.

So there you have it. Now Hurry up and Wait for your dirty cookbooks to run their tests!