Configure and test windows infrastructure using Powershell technologies DSC and Pester running from Chef and Test-Kitchen by Matt Wrock

About a week ago I attended the 2014 Chef Summit. I got to meet a bunch of new and interesting people and also met several who I had interacted with online but had never seen in person. One new person I met was Jay Mundrawala (@jdmundrawala). Jay works for chef and built a Test-Kitchen Busser for Pester (as a personal oss contribution and not as part of his job at Chef). You might ask…a What for What? Well this post is going to attempt to answer that and explain why I think it is important.

Pester

Pester is a unit testing framework for Powershell. It was originally created by Scott Muc (@scottmuc) a few years back. I joined in in 2012 to add support for Mocking and now development has largely been taken over by Dave Wyatt (@MSH_Dave). It is a BDD style approach to writing and running unit tests for powershell. However, as we will see here, you can write more than just unit tests. You can write a suite of tests to ensure your infrastructure is built and runs as intended.

The whole idea of writing tests for powershell is new to a lot of long time scripters. However, as just mentioned, this framework has been around for a few years but is just now starting to gain some popularity among the powershell community and in fact the Powershell team at Microsoft is now beginning to use it themselves.

Many entrenched in the Chef ecosystem have undoubtedly been exposed to rspec and rspec derivative tools for writing tests for their chef recipes and other ruby gems. Pester is very much inspired by rspec and many familiar with rspec who take a first look at Pester may not immediately notice the difference. There are indeed several differences but the primary difference is one is written in and for ruby and the other powershell.

Test-Kitchen

Test kitchen is a tool that is widely used within the Chef community but can also be used by other Configuration management tools like Puppet. Test kitchen is not a test framework per se but it is a sort of meta framework that provides a plugin architecture around configuration management scripts that makes it easy to use one or more of many testing frameworks with your infrastructure management scripts.

There are issues specific to configuration management that make such a tool as Test-kitchen very useful. In addition to simply running tests, Test-Kitchen can manage the creation and destruction of a VM or other computing resource where tests can be run in a repeatable, disposable and rebuildable manner. Again, this is managed by another plugin family of provisioners. Some may use the vagrant driver, others docker, vsphere, EC2, etc. Using Test-kitchen, I can watch as an instance is provisioned, built, tested ad then destroyed without any side effects impacting my local environment.

The plugin that manages different test frameworks is called the busser. This plugin is responsible for “bussing” code from your local machine to a virtual test instance. Jay’s busser, like all the others simply make sure that Pester gets installed on the system where you want your tests to run. Since Pester is a powershell based tool. You are typically going to be running Pester tests on a windows machine and the cool thing here is that you can write them in “pure” powershell. No need to wrap all of your powershell inside of ruby language constructs. Its all 100% powershell here.

Enter DSC – Microsoft’s Desired State Configuration

This is an interesting one because it is both a product (or API) of a specific technology vendor and a long time philosophical approach to infrastructure management. Some also incorrectly interpret it as a competitor trying to unseat  tools like chef or Puppet. There is indeed some overlap between DSC and other configuration management tools but the easiest way to groc how DSC fits into the CM landscape is as an API for writing resources specifically for windows infrastructure. Chef, Puppet and other tools provide a broad range of features to help you oversee and codify your infrastructure. The DSC surface area is really much simpler. DSC as it stands today consists of a constantly growing set of resources that can be leveraged in your configuration management tool of choice.

What do I mean by “resource?” Resource is a ubiquitous term in the popular CM tools used to provide an abstraction or DSL over a concrete piece of infrastructure (user, group, machine, file, firewall rule, etc) The resource descries how you want this infrastructure to look and does so in code that can be reviewed, tested, linted and source controlled.

You can use straight up DSC to execute these resources which offers a bare bones approach, or you can wrap them inside of a Chef recipe that can live alongside of non-DSC resources. Now the DSC resource for your windows roles and features, sql server HA, registry keys sits inside of your larger Chef infrastructure of nodes, environments, attributes, etc.

Chef making it easy to execute DSC resources

An initial reaction to this by many would be users of DSC is, why would I use Chef? Don’t I have to learn Ruby to work with that? Well because Chef is a full featured, mature configuration management solution, you get access to all of the great reporting, and server management features of chef. If you have a mixed windows/linux shop, you can manage everything with chef. Finally, it can be a bit unwieldy using raw DSC on its own. Before you can execute DSC resources, they must be downloaded and installed. Chef makes that super easy. And as we will see with test-kitchen, now you can plug your powershell based tests right into your chef workflow.

A real world example of executing DSC resources with chef and testing with Pester

We are going to follow a typical chef workflow of writing a cookbook to build a server. In our case it will be an IIS powered web server that hosts a Nuget package feed. Nuget is a windows package management specification very similar to ruby Gems. Its also the same specification behind windows Chocolatey packages similar to apt-get/yum/rpm for linux. Our web server will provide a rest based feed similar to rubygems.org that one can use to discover nuget packages.

Welcome to the bleeding edge

Before we get started let me point out that testing cookbooks on windows has not historically been well supported but there is more interest than ever in it today. There is very active development that is driving to make this possible but it is still not available from the latest stable version of Test-Kitchen. During this year’s Chef Summit, this exact topic was discussed. The creator and maintainer of Test-Kitchen, Fletcher Nichols was present as well as several others either interested in windows support  or actively working to provide first class support for windows like Salim Afiune. I was there as well and I think everyone left with a clear understanding that this work needs to come together in a future version of Test-Kitchen in the near future. I blogged on the current state of this tooling just a couple months ago. This may be seen as a continuation of that post with a specific bend towards powershell and DSC.

I will walk you through how to get your environment configured so that you can do this testing today and I will certainly update this post once the tooling is officially released.

Environment setup

I am going to assume that you do not have any of the necessary tools needed to run through the sample cookbook I am about to show. So you can pick and choose what you need to add to your system. I am also assuming you are using the ruby embedded with chefDK. If you have another ruby versioning environment, chances are you know what to do. Note: this environment does not need to be a windows box.

ChefDK

First and foremost you need chef. The easiest way to get chef along with many of the popular tools in its ecosystem like test-kitchen is to install the Chef development kit. There are downloads available for windows, mac and several linux distributions.

Vagrant

This tutorial will use Vagrant to instantiate a machine to run the cookbook and execute the tests. You can download vagrant from VagrantUp and like chef, it has downloads for all of the popular platforms.

A hypervisor

You will need something that your vagrant flavored VM can run in. Many prefer the free and feature complete VirtualBox. If you run on windows and are currently using versions 8/2012 and above, you may use Hyper-V already on your box. Note you cannot run both on the same boot instance.

Git

You will be using git to download some of the tools I am about to mention.

The WinRM Test-Kitchen fork

This will eventually and hopefully soon be merged into the authoritative test-kitchen repo. This fork has been largely developed by Salim Afiune and can be found here. There is still active development here. Currently I have my own fork of this fork working to improve performance of winrm based file transfers. My fork hopes to dramatically improve upload times of cookbooks to the test instance. The cookbook in this tutorial should just take a couple minutes to upload using my fork compared to nearly an hour and we hope to get the perf much more faster than that. Note that WinRM has no equivalent SCP functionality so implementing this is a bit crude. Here is how you can use and install my fork:

git clone -b one_session https://github.com/mwrock/test-kitchen
copy-item test-kitchen\lib `
  C:\opscode\chefdk\embedded\apps\test-kitchen `
  -recurse -force
copy-item test-kitchen\support `
  C:\opscode\chefdk\embedded\apps\test-kitchen `
  -recurse -force
cd test-kitchen
C:\opscode\chefdk\embedded\bin\gem build test-kitchen.gemspec
C:\opscode\chefdk\embedded\bin\gem install test-kitchen-1.3.0.gem

The Winrm based Kitchen-Vagrant plugin fork

Salim has also ported his enhancements to the popular Kitchen-Vagrant Test-Kitchen plugin. Since this tutorial uses vagrant, you will need this fork. Note that if you plan to use Hyper-V or a non VirtualBox hypervisor, please use my fork that includes recent changes to make vagrant and the winrm test kitchen work outside of VirtualBox. Here is how to get and install this:

git clone -b Transport https://github.com/mwrock/kitchen-vagrant
cd kitchen-vagrant
C:\opscode\chefdk\embedded\bin\gem build kitchen-vagrant.gemspec
C:\opscode\chefdk\embedded\bin\gem install kitchen-vagrant-0.16.0.gem

The dsc_nugetserver repository containing a sample cookbook and pester tests

This can simply be cloned from https://github.com/mwrock/dsc_nugetserver.

DSC in a chef recipe

Similar to the WinRM Test-Kitchen work, the DSC recipe work done by the folks at chef is still in fairly early development. There is a dsc_script resource available in the latest chef client release as of this post. There is also a community cookbook that represents a prototype of work that will be evolved into the core chef client. This cookbook contains the dsc_resource resource.

I intentionally wrote the dsc_nugetserver cookbook almost entirely from DSC resources. Lets take a look in the default recipe and observe the two flavors of the dsc resource.

dsc_script

dsc_script  "webroot" do
  code <<-EOH
    File webroot
    {
      DestinationPath="C:\\web"
      Type="Directory"
    }
  EOH
end

This is what is currently supported by the official chef-client and ships with the latest version. They really just wrap the DSC Configuration syntax supported by powershell today. The benefit that you get using it inside of a chef recipe are that you can now use the dsc_script as just another resource in your wider library of cookbooks. Chef also does some leg work for you. You do not need to worry about where the resource is installed and you do not need to compile the resource before use.

dsc_resource

dsc_resource "http firewall rule" do
  resource_name :xfirewall
  property :name, "http"
  property :ensure, "Present"
  property :state, "Enabled"
  property :direction, "Inbound"
  property :access, "Allow"
  property :protocol, "TCP"
  property :localport, "80"
end

This is really similar if not the same as dsc_scipt just with different syntax. Note the use of the property DSL. dsc_resource also does a much better job at finding the correct resource. While I believe that dsc_script only works with the official microsoft preinstalled resources, the community dsc cookbook can locate the newer experimental resources that are being distributed as part of the community resource kit waves.

Using the resource_kit recipe to download and install all of the current resource wave kit modules

I have included a recipe that will download the latest batch of resource wave dsc resources. I basically just copied this from one of chef’s own cookbook examples and replaced the download url with the latest resource wave. Once this recipe runs, literally all dsc resources are available for you to use.

Whatif Bug affecting most resources used within chef

There is a bug in both of the dsc resource flavors that will cause most resources to crash. If the dsc resource either does not support ShouldProcess of if the underlying call to powershell DSC’s Set-TargetResource results in the function throwing an error, these chef resources currently to not provide graceful failure for these scenarios. So as is, the resource will break when called. The chef team knows about this and has a fix that will be released in a future release.

In the meantime, I have forked the community dsc_resource in the dsc cookbook and commented out a single line. I can consume this fork from any cookbook by adding this to the Berksfile:

source "https://supermarket.getchef.com"

metadata

cookbook 'dsc' , git: 'https://github.com/mwrock/dsc'

Converging the recipe

The sample cookbook comes with both a .kitchen.yml file that includes a pointer to an evaluation copy of windows 8.1 for testing. I would have included a 2012 box instead but my 2012 vagrant box is Hyper-V only and I have not had time to add virtual box.

So running:

kitchen converge

Should create a windows box for testing and converge that box to run the sample recipe.

[2014-10-13T02:34:38-07:00] INFO: Getting PowerShell DSC resource 'xfirewall'
[2014-10-13T02:35:26-07:00] INFO: DSC Resource type 'xfirewall' Configuration completed successfully
[2014-10-13T02:35:29-07:00] INFO: Chef Run complete in 534.665725 seconds
[2014-10-13T02:35:29-07:00] INFO: Removing cookbooks/dsc_nugetserver/files/default/NugetServer.zip from the cache; it is no longer needed by chef-client.
[2014-10-13T02:35:29-07:00] INFO: Running report handlers
[2014-10-13T02:35:29-07:00] INFO: Report handlers complete
Finished converging <default-windows-81> (13m6.61s).
-----> Kitchen is finished. (13m11.87s)
C:\dev\dsc_nugetserver [master]>

Note that there is a chance the kitchen converge will fail shortly after creating the box and just before downloading the chef client. My suspicion is that this is because the windows 8.1 box is hard at work installing updates and the initial winrm call times out. I have always had success immediately calling kitchen converge again.

So once this completes, you should be able to open a local browser and point at your test box to see the nuget server informational home page:

Testing the recipe with Pester

Here are the tests we will run with Pester:

describe "default recipe" {

  it "should expose a nuget packages feed" {
    $packages = Invoke-RestMethod -Uri "http://localhost/nuget/Packages"
    $packages.Count | should not be 0
    $packages[0].Title.InnerText | should be 'elmah'
  }

  context "firewall" {

    $rule = Get-NetFirewallRule | ? { $_.InstanceID -eq 'http' }
    $filter = Get-NetFirewallPortFilter | ? { $_.InstanceID -eq 'http' }

    it "should filter port 80" {
      $filter.LocalPort | should be 80
    }
    it "should be enabled" {
      $rule.Enabled | should be $true
    }
    it "should allow traffic" {
      $rule.Action | should be "Allow"
    }
    it "should apply to inbound traffic" {
      $rule.Direction | should be "Inbound"
    }    
  }
}

This is 100% powershell. No ruby to see here.

This is first going to test our nuget server website. If all went as we intended, an http call to the root of localhost should reach our nuget server and it should behave like a nuget feed. So here we expect the Packages feed to return some packages and knowing what the first package should be, we test that its name is what we expect.

Because Test-Kitchen runs tests on the converged node, we need to be sure that the outside world can reach our entry point. So we go ahead and test that we opened the firewall correctly.

kitchen verify

The kitchen Pester busser now installs Pester:

C:\dev\dsc_nugetserver [master]> kitchen verify
-----> Starting Kitchen (v1.3.0)
-----> Setting up <default-windows-81>...
       Successfully installed thor-0.19.0
       Successfully installed busser-0.6.2
       2 gems installed
       Plugin pester installed (version 0.0.6)
-----> Running postinstall for pester plugin
-----> [pester] Installing PsGet
Downloading        PsGet from https://github.com/psget/psget/raw/master/PsGet/PsGet.psm1
PsGet is installed and ready to use
       USAGE:
           PS> import-module PsGet
           PS> install-module PsUrl

       For more details:
           get-help install-module
       Or visit http://psget.net
-----> [pester] Installing Pester

Then it runs our tests:

-----> Running pester test suite
-----> [pester] Running
Executing all tests in 'C:\tmp\busser\suites\pester'Describing        default recipe
[+] should expose a nuget packages feed 4.02s   Context        firewall
[+] should filter port 80 3.18s           
[+] should be enabled 16ms
[+] should allow traffic 12ms
[+] should apply to inbound traffic 13ms
Tests completed in 7.23s
       Passed: 5 Failed: 0
       Finished verifying <default-windows-81> (0m22.55s).
-----> Kitchen is finished. (0m59.74s)
C:\dev\dsc_nugetserver [master]>

Bugs regarding Execution Policy

One issue I ran into both with the dsc_resource resource and the Pester busser was a failure to bypass the ExecutionPolicy of the Powershell.exe process. This means if no one has explicitly set an execution policy on the box which they would not have if this is a newly provisioned machine and unless this is windows server 2012R2 which implements a new default ExecutionPolicy of RemoteSigned instead of Undefined, the converge will fail complaining that the execution of scripts are not allowed to run. Since the test vagrant box used here is windows 8.1, it is susceptible to this bug.

You can work around this by setting the execution policy in the recipe as is done in the sample:

powershell_script "set execution policy" do
  code <<-EOH
    Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Force
    if(Test-Path "$env:SystemRoot\\SysWOW64") {
      Start-Process "$env:SystemRoot\\SysWOW64\\WindowsPowerShell\\v1.0\\powershell.exe" -verb runas -wait -argumentList "-noprofile -WindowStyle hidden -noninteractive -ExecutionPolicy bypass -Command `"Set-ExecutionPolicy RemoteSigned`""
    }
    EOH
end

We set the policy for both the 64 and 32 bit shells since chef-client is a 32 bit process.

I have filed an issue with the dsc cookbook here and submitted a pull request for the busser here.

Testing on windows keeps getting better

We are still not where we need to be but we are making progress. This is a big step I think to adding accesibility to the work coming out of the Microsoft DSC initiatives. Here you have all the tools you need to not only execute DSC resources but also test them. A big thanks to Jay and Salim for their work here with Test Kitchen and the Pester busser!

If you want to learn more about DSC or Chef and particularly DSC in Chef. Pay attention to Steven Murawski's blog. Steven is a Chef Community manager and has done a ton of work with DSC at his previous employer Stack Exchange, the home of StackOverflow.com.

Chef Cookbook dependency management and the environment cookbook pattern by Matt Wrock

Last week I discussed how we at CenturyLink Cloud are approaching the versioning of cookbooks, environments and data bags focusing on our strategy of generating version numbers using git commit counts. In this post I’d like to explore one of the concrete values these version numbers provide in your build pipeline. We’ll explore how these versions can eliminate surprise breaks when cookbook dependencies in production may be different from the cookbooks you used in your tests. You are testing your cookbooks right?

A cookbook is its own code plus the sum of its dependencies

My build pipeline must be able to guarantee to the furthest extent possible that my build conditions match the same conditions of production. So if I test a cookbook with a certain set of dependencies and the tests pass, I want to have a high level of confidence that this same cookbook will converge successfully on production environments. However, if I promote the cookbook to production but my production environment has different versions of the dependent cookbooks, this confidence is lost because my tests accounted for different circumstances.

Even if these different versions are just at the patch level (the third level version number in the semantic version numbering schema), I still consider this an important difference. I know that only changes of the major build number should include breaking changes but lets just see a show of hands of those who have deployed a bug fix patch that regressed elsewhere…I thought so. You can put your hands down now everyone. Regardless, it can be quite simple to allow major version changes to slip in if your build process does not account for this.

Common dependency breakage scenarios

We will assume that you are using Berkshelf to manage dependencies and that you keep your Berksfile.lock files synced and version controlled. Lets explore why this is not enough to protect from dependency change creep.

Relaxed version constraints

Even if you have rigorously ensured that your metadata.rb files include explicit version constraints, there is never any guarantee that some downstream community cookbook has not specified any constraints in its metadata.rb file. You might think this is exactly where the Berksfile.lock saves you. A Berksfile.lock will snap the entire dependency graph to the exact version of the last sync. If this lock file is in source control, you can be assured that fellow teammates and your build system are testing the top level cookbook with all the same dependency versions. So far so good.

Now you upload the cookbooks to the chef server and when its time for a node to converge against your cookbook changes, where is your Berksfile.lock now? Unless you have something in place to account for this, chef-client is simply going to get the highest available version for cookbook dependencies without constraints. If anyone at any time uploads a higher version of a community cookbook to the chef server that is used by other cookbooks that had been tested against lesser versions, a break can easily occur.

Dependency islands in the same environment

This is related to the scenario just described above and explains how cookbook dependencies can be uploaded from one successful cookbook test run that can break sibling cookbooks in the same environment.

The Berksfile.lock file generates the initial dependency graph upon the very first berks install. Therefore you can create two cookbooks that have all of the same cookbook dependencies but if you build the Berksfile.lock file even hours apart, there is a perfectly reasonable possibility that the two cookbooks will have different versioned dependencies in their respective Berksfile.lock files.

Once both sets are uploaded to the chef server and unless an explicit version constraint is specified, the highest eligible version wins and this may be a different version than some of the cookbooks that use this dependency were tested against. So now you can only hope that everything works when nodes start converging.

Poorly timed application of cookbook constraints

You may be thinking the obvious remedy to all of these dependency issues is to add cookbook constraints to your environment files. I think you are definitely on the right track. This will eliminate the mystery version creep scenarios and you can look at your environment file and know exactly what versions of which cookbooks will be used. However in order for this to work, it has to be carefully managed. If I promote a cookbook version along with all of its dependencies by updating the version constraints in my environment, can I be guaranteed that all other cookbooks in the environment with the same downstream dependencies  have been tested with any new versions being updated?

I do believe that constraining environment versions is important. These constraints can serve the same function as your Berksfile.lock file within any chef server environment. Unless the entire matrix of constraints matches up against what has been tested, these constraints provide inadequate safety.

Safety guidelines for cookbook constraints in an environment

A constraint for every cookbook

Not only your internal top level cookbooks should be constrained. All dependent cookbooks should also include constraints. Any cookbook missing a constraint introduces the possibility that an untested dependency graph will be introduced.

All constraints point to exact versions (no pessimistic constraints)

I do think pessimistic constraints are fine in your Berksfile.lock or metadata.rb files. This basically says you are ok with upgrading dependencies within dev/test scenarios, but once you need to establish a baseline of a known set of good cookbooks, you want that known group to be declared rigidly. Unless you point to precise versions, you are stating that you are ok with “inflight” change and that’s the change that can bring down your system.

Test all changes and all cookbooks potentially affected by changes

You will need to be able to identify what cookbooks that had no direct changes applied but are part of the same graph undergoing change. In other words if both cookbook A and B depend on C and C gets bumped as you are developing A, you must be able to automatically identify that B is potentially affected and must be tested to validate your changes to A even though no direct changes were made to B. Not until A, B, and C all converge successfully can you consider the changes to be included in your environment.

Leveraging the environment cookbook pattern to keep your tree whole

The environment cookbook pattern was introduced by Jamie Windsor in this post. The environment cookbook can be thought of as the trunk or root of your environment. You may have several top level cookbooks that have no direct dependencies on one another and therefore you are subject to the dependency islands referred to above. However if you have a common root declaring a dependency on all your top level cookbooks, you now have a single coherent graph that can represent all cookbooks.

The environment cookbook pattern prescribes the inclusion of an additional cookbook for every environment you want to apply this versioning rigor. This cookbook is called the environment cookbook and includes only four files:

README.md

Providing thorough documentation of the cookbook’s usage.

metadata.rb

Includes one dependency for each top level cookbook in your environment.

Berksfile and Berksfile.lock

These express the canonical dependency graph of your chef environment. Jamie suggests that this is the only Berksfile.lock you need to keep in source control. While I agree it’s the only one that “needs” to be in source control, I do see value in keeping the others. I think by keeping “child” Berksfile.lock files in sync the top level dependencies may fluctuate less often and provide a bit more stability during development.

Generating cookbook constraints against an environment cookbook

Some will suggest using berks apply in the environment cookbook and point to the environment you want to constrain. I personally do not like this method because it simply uploads the constraints to the environment on the chef server. I just want to generate it locally first where I can run tests and version control the environment file first.

At CenturyLink Cloud we have steps in our CI pipeline that I believe not only adds the correct constraints but allows us to identify all cookbooks impacted by the constraints and also ensures that all impacted cookbooks are then tested against the exact same set of dependencies. Here is the flow we are currently using:

Generating new cookbook versions for changed cookbooks

As included in the safety guidelines above, this not only means that cookbooks with changed code get a version bump, it also means that any cookbook that takes a dependency on one of these changed cookbooks also gets a bump. Please refer to my last post which describes the version numbering strategy. This is a three step process:

  1. Initial versioning of all cookbooks in the environment. This results in all directly changed cookbooks getting bumped.
  2. Sync all individual Berksfile.lock files. This will effectively change the Berksflie.locks of all dependent cookbooks.
  3. A second versioning pass that ensures that all cookbooks affected by the Berksfile.lock updates also get a version bump.

Generate master list of cookbook constraints against the environment cookbook

Using the Berksfile of the environment cookbook, we will apply the new cookbook versions to a test environment file:

def constrain_environment(environment_name, cookbook_name)
  dependencies = environment_dependencies(cookbook_name)
  env_file = File.join(config.environment_directory, 
    "#{environment_name}.json")
  content = JSON.parse(File.read(env_file))
  content['cookbook_versions'] = {}
  dependencies.each do | dep |
    content['cookbook_versions'][dep.name] = dep.locked_version.to_s
  end

  File.open(env_file, 'w') do |out|
    out << JSON.pretty_generate(content)
  end
  dependencies
end

def environment_dependencies(cookbook_name)
  berks_name = File.join(config.cookbook_directory, 
    cookbook_name, "Berksfile")
  berksfile = Berkshelf::Berksfile.from_file(berks_name)
  berksfile.list
end

This will result in a test.json environment file getting all of the cookbook constraints for the environment. Another positive byproduct of this code is that it will force a build failure in the event of version conflicts.

It is very possible that one cookbook will declare a dependency with an explicit version while another cookbook declares the same cookbook dependency but with a different version constraint. In these cases the Berkshelf list command invoked above will fail because it cannot satisfy both constraints. Its good that it fails now so you can align the versions before the final constraint is locked and potentially causing a version conflict during a chef node client run.

Run kitchen tests for impacted cookbooks against the test.json environment

How do we identify the impacted cookbooks? Well as we saw above, every cookbook that was either directly changed or impacted via a transitive dependency got a version bump. Therefore it’s a matter of comparing a cookbook’s new version to the version of the last known good tested environment. I've created an is_dirty function to determine if a cookbook needs to be tested:

def is_dirty(environment_name, cookbook_name, environment_cookbook)
  dependencies = environment_dependencies(environment_cookbook)
  
  cb_dependency = (
    dependencies.select { |dep| dep.name == cookbook_name })[0]

  env_file = File.join(config.environment_directory, 
    "#{environment_name}.json")
    
  content = JSON.parse(File.read(env_file))
  if content.has_key?('cookbook_versions')
    if content['cookbook_versions'].has_key?(cookbook_name)
      curr_version = cb_dependency.locked_version.to_s
      curr_version != content['cookbook_versions'][cookbook_name]
    else
      true
    end
  end
end

This method takes the environment that represents my last known good environment (the one where all the tests passed), the cookbook to check for dirty status and the environment cookbook. If the cookbook is clean, it effectively passes this build step.

In a future post I may go into detail regarding how we utilize our build server to run all of these tests concurrently from the same git commit and aggregate the results into a single master integration result.

Create a new Last Known Good environment

If any test fails, the entire build fails and it all stops for further investigation. If all tests pass, we run through the above constrain_environment method again to produce the final cookbook constraints of our Last Known Good environment which serves as a release candidate of cookbooks that can converge our canary deployment group. The deployment process is a topic for a separate post.

The Kitchen-Environment provisioner driver

One problem we hit early on was that when test-kitchen generated the Berksfile dependencies to ship to the test instance, the versions it generated may differ from the versions in the environment file. This was because Test-Kitchen’s chef-zero driver as well as most the other chef provisioner drivers, run a berks vendor against the Berksfile of the individual cookbook under test. These may produce different versions than a berks vendor against the environment cookbook and it also illustrates why we are following this pattern. When this happens, it means that the individual cookbook on its own runs with a different dependency than it may in a chef server.

What we needed was a way for the provisioner to run berks vendor against the environment cookbook. The following custom provisioner driver does just this.

require "kitchen/provisioner/chef_zero"

module Kitchen

  module Provisioner

    class Environment < ChefZero

      def create_sandbox
        super
        prepare_environment_dependencies
      end

      private

      def prepare_environment_dependencies
          tmp_env = "TMP_ENV"
          path = File.join(tmpbooks_dir, tmp_env)
          env_berksfile = File.expand_path(
            "../#{config[:environment_cookbook]}/Berksfile", 
            config[:kitchen_root])    

          info("Vendoring environment cookbook")    
          ::Berkshelf.set_format :null

          Kitchen.mutex.synchronize do
             Berkshelf::Berksfile.from_file(env_berksfile).vendor(path)

              # we do this because the vendoring converts metadayta.rb
              # to json. any subsequent berks command on the 
              # vendored cookbook will fail
              FileUtils.rm_rf Dir.glob("#{path}/**/Berksfile*")

              Dir.glob(File.join(tmpbooks_dir, "*")).each do | dir |
                cookbook = File.basename(dir)
                if cookbook != tmp_env
                  env_cookbook = File.join(path, cookbook)
                  if File.exist?(env_cookbook)
                    debug("copying #{env_cookbook} to #{dir}")
                    FileUtils.copy_entry(env_cookbook, dir)
                  end
                end
              end
              FileUtils.rm_rf(path)
          end
      end

      def tmpbooks_dir
        File.join(sandbox_path, "cookbooks")
      end

    end
  end
end

Environments as cohesive units

This is all about treating any chef environment as a cohesive unit wherein any change introduced must be considered upon all parts involved. One may find this to be overly rigid or change adverse. One belief I have regarding continuous deployment is that in order to be fluid and nimble, you must have rigor. There is no harm in a high bar for build success as long as it is all automated and therefore easy to apply the rigor. Having a test framework that guides us toward success is what can separate a continuous deployment pipeline from a continuous hot fix fire drill.

Using git to version stamp chef artifacts by Matt Wrock

This post is not about using git for source control. It assumes that you are already doing that. What I am going to discuss is a version numbering strategy that leverages the git log. The benefit here is the guarantee that any change in the artifact (cookbook, environment, data bag) will result in a unique version number that will not conflict with other versions provided by your fellow teammates. It ensures that deciding on what version to stamp your change is one thing you don’t need to think about. I'll close the post demonstrating how this can be automated as a part of your build process.

The strategy explained

You can use the git log command to list all commits applied to a directory or individual file in your repository:

git log --pretty=oneline some_directory/

This will list all commits within the some_directory directory in a single line per commit that prints the sha1 and the commit comment. To make this a version, you would count these lines:

powershell:
(git log --pretty=oneline some_directory/).count

bash:
git log --pretty=oneline some_directory/ | wc –l

Semantic versioning

If you are using semantic versioning to express version numbers, the commit count can be used to produce the build number – the third element of a version. So what about the major and minor numbers? One argument you can pass to git log is a starting ref from which to list commits. When you decide to increment the major or minor build number, you want to tag your repository with those numbers:

git tag 2.3

So now you want your build numbers to reset to 0 starting from the commit being tagged. You can do this by by telling git’s log command to list all commits from that tag forward like so:

git log 2.3.. --pretty=oneline some_directory/ 

If you were to run this just after tagging your repo with the major and minor versions, you would get 0 commits and thus the semantic version would be 2.3.0. So you will need to give thought to incrementing major and minor build number but the final element just happens upon commit.

Benefits and downsides to this strategy

Before getting into the details of applying this to chef artifacts, lets briefly examine some pros and cons of this technique.

Upsides

Any change to a versionable artifact will result in a unique build number and if two builds have the same contents, their build numbers will be the same

This is crucial especially if you need to communicate with customers or fellow team members regarding features or bugs. This can help to remove confusion and ensure you are discussing the same build. If you are using a bug tracking system, you will want to include this version in the bug report so other team members reviewing the bug can checkout that version from source control or review all changes made since that version was committed.

Builds can be produced independently of a separate build server

Especially for solo/side projects where you may not even have a build server, this can help you create deterministic build numbers. However even if your project’s authoritative builds are produced by a system like Jenkins or TeamCity, individual team members can produce their own builds and produce the same build numbers generated by your build server (assuming the build server is using this strategy). Of course the number may vary slightly if other team members have produced commits and have not yet pushed to your shared remote or if the build is performed without pulling the latest changes. That’s why you also want to include the current sha1 somewhere in your artifact. More on that later.

Allows you to separately version different artifacts in your repository

Especially if your chef repository houses multiple cookbooks and you freeze your cookbook versions or use version constraints in your environments, this can be very important. I want to know that any change to a cookbook will increment the version and if the cookbook has remained unchanged, its version should be the same.

Downsides

There will be gaps in your build numbers

You will likely commit several times between builds. So two subsequent builds with say 5 commits in between will increment the build number by 5. This should not be an issue as long as your team is aware of this. However, if you consider sequential build numbers important as a customer facing means to communicate change, this could be an issue. I have used this technique on a couple of fairly popular OSS projects and I never had an issue with users or contributors stumbling on this.

Build numbers can get big

If you rarely increment the major or minor build numbers, this will surely happen over time. I try to increment the minor number on any feature enhancing release in which case this is not usually an issue.

If build agents cannot talk to git

If you are using a centralized build server and if this is a collaborative project you certainly should be, you definitely want the builds produced by your build server to follow this same strategy. In order to do that, you want to configure your build server to delegate the git pull to the build agents. Otherwise, the git log commands will not work. The build agent must have an actual git repo with the .git folder available to see the commit counts.

Applying this to chef artifacts

First, what do I mean by “chef artifacts?” Don’t I really mean cookbooks? No. While cookbooks are certainly included and are the most important artifact to version, I also want to version environment and data_bag files. If I used roles, I would version those too. Regardless of the fact that cookbooks are the only entity that has first class versioning support on a chef server, I should be able to pin these artifacts to their specific git commit. Also, I may change environment or data_bag files several times before uploading to the server and I may want to choose a specific version to upload. If you add cookbook version constraints to your environments, any dependency change will result in a version bump to your environment and your environment version may serve as a top level repository version.

Stamping the artifact

So what gets stamped where? For cookbooks this is obvious. The version string in metadata.rb will have the generated version applied. For environment and data_bag files, we create a new json element in the document:

{
  "name": "test",
  "chef_type": "environment",
  "json_class": "Chef::Environment",
  "override_attributes": {
    "environment_parent": "QA",
    "version": "1.0.24",
    "sha1": "c53bdaa92d67bea151928cdff10a8d5e634ec880"
  },
  "cookbook_versions": {
    "apt": "2.6.0",
    "build-essential": "2.0.6",
    "chef-client": "3.7.0",
    "chef_handler": "1.1.6",
    "clc_library": "1.0.20",
    "cron": "1.5.0",
    "curl": "2.0.0",
    "dmg": "2.2.0",
    "git": "4.0.2",
    "java": "1.28.0",
    "logrotate": "1.7.0",
    "ms_dotnet4": "1.0.2",
    "newrelic": "2.0.0",
    "platform_couchbase": "1.0.31",
    "platform_elasticsearch": "1.0.40",
    "platform_environment": "1.0.1",
    "platform_haproxy": "1.0.36",
    "platform_keepalived": "1.0.4",
    "platform_octopus": "1.0.13",
    "platform_rabbitmq": "1.0.33",
    "platform_win": "1.0.71",
    "provisioner": "1.0.209",
    "queryme": "1.0.2",
    "runit": "1.5.10",
    "windows": "1.34.2",
    "yum": "3.3.2",
   "yum-epel": "0.5.1"
  }
}

I add the version as an override attribute since you cannot add new top level keys to environment files. However for data_bag files I do insert the version as a top level json key.

Including the sha1

You may have noticed that the environment file displayed above has a sha1 attribute just below the version. Every commit in git is identified by a sha1 hash that uniquely identifies it. While the version number is a human readable form of expressing changes and can still be used to find the specific commit in git that produced the version, having the sha1 included with the version makes it much easier to track down the specific git commit. I can simply do a:

git checkout <sha1>

This will update my working directory to match all code exactly as it was when that version was commited. If you report problems with a cookbook and can give me this sha1, I can bring up its exact code in seconds.

As we have already seen, the sha1 is stored in a separate json attribute for environment and data_bag files. For cookbook metadata.rb file, I add this as a comment to the end of the file:

name        'platform_haproxy'
maintainer  'CenturyLink Cloud'
license     'All rights reserved'
description 'Installs/Configures haproxy for platform'
version     '1.0.36'

depends     'platform_keepalived'
depends     'newrelic'
#sha1 'c53bdaa92d67bea151928cdff10a8d5e634ec880'

Bringing all of this together with automation

At CenturyLink Cloud, we are using this strategy for our own chef versioning. I have been working on a separate “promote” gem that oversees our delivery pipeline of chef artifacts. This gem exposes rake tasks that handle the versioning discussed in this post as well as the process of constraining cookbook versions in various qa and production environments and uploading these artifacts to the correct chef server. The rake tasks tie in to our CI server so that the entire rollout is automated and auditable. I’ll likely share different aspects of this gem in separate posts. It is not currently open source, but I can certainly share snippets here to give you an idea of how this generally works.

Our Rakefile loads in the tasks from this gem like so:

config = Promote::Config.new({
  :repo_root => TOPDIR,
  :node_name => 'versioner',
  :client_key => File.join(TOPDIR, ENV['versioner_key']),
  :chef_server_url => ENV['server_url']
  })
Promote::RakeTasks.new(config)
task :version_chef => [
  'Promote:version_cookbooks', 
  'Promote:version_environments', 
  'Promote:version_data_bags'
]

so rake version_chef will stamp all of the necessary artifacts with their appropriate version and sha1. The code for versioning an individual cookbook looks like this:

def version_cookbook(cookbook_name)
  dir = File.join(config.cookbook_directory, cookbook_name)
  cookbook_name = File.basename(dir)
  version = version_number(current_tag, dir)
  metadata_file = File.join(dir, "metadata.rb")
  metadata_content = File.read(metadata_file)
  version_line = metadata_content[/^\s*version\s.*$/]
  current_version = version_line[/('|").*("|')/].gsub(/('|")/,"")

  if current_version != version
    metadata_content = metadata_content.gsub(current_version, version)
    outdata = metadata_content.gsub(/#sha1.*$/, "#sha1 '#{sha1}'")
    if outdata[/#sha1.*$/].nil?
      outdata += "#sha1 '#{sha1}'"
    end
    File.open(metadata_file, 'w') do |out|
      out << outdata
    end
    return { 
      :cookbook => cookbook_name, 
      :version => version, 
      :sha1 => sha1}
  end
end

def version_number(current_tag, ref)
  all = git.log(10000).object(ref).between(current_tag.sha).size
  bumps = git.log(10000).object(ref).between(current_tag.sha).grep(
    "CI:versioning chef artifacts").size
  commit_count = all - bumps
  "#{current_tag.name}.#{commit_count}"
end

This uses the git ruby gem to interact with git and plops in the version and sha1 into metadata.rb. Note, that we exclude all commits labeled “CI:versioning chef artifacts.” After our CI server runs this task, it commits and pushes the changes back to git. We don’t want to include this commit in our versioning. We also adjust our CI version control trigger to filter out this commit from commits that can initiate a build otherwise we would end up in an infinite loop of builds.

Adding a Berkshelf sync

After we generate the new versions but before we push the versions back to git we want to sync up our Berksfile.lock files so we run this:

cookbooks = Dir.glob(File.join(config.cookbook_directory, "*"))
cookbooks.each do |cookbook|
  berks_name = File.join(
    config.cookbook_directory, 
    File.basename(cookbook), 
    "Berksfile")
  if File.exist?(berks_name)
    Berkshelf.set_format :null
    berksfile = Berkshelf::Berksfile.from_file(berks_name)
    berksfile.install
  end
end

This ensures that the CI commit includes up to date Berksfile.lock files that may very well have changed due to the version changes in cookbooks that depend on one another. This will also be necessary in generating the environment cookbook constraints but that will be covered in a future post.

Thoughts?

I realize this is not how most version their chef artifacts or non chef artifacts for that matter. I know many folks use knife spork bump. You can certainly leverage spork with this strategy as well but just provide the git generated version instead of letting spork auto increment. This versioning strategy has proven itself to be very convenient for me on non chef projects. I’d be curious to get feedback from others on this technique. Any obvious or subtle pitfalls you see?

Hurry up and wait! Tales from the desk of an automation engineer by Matt Wrock

I have never liked the title of my blog: “Matt Wrock’s software development blog.” Boring! Sure it says what it is but that’s no fun and not really my style. So the other week I was taking a walk in my former home town of San Francisco and it suddenly dawned on me “Hurry up and wait.” The clouds opened up, doves descended and a black horse crossed my path and broke the 9th seal…then the dove pooped on my shoulder which distracted me and I went on about my day. Later I recalled the original epiphany and decided to purchase the domain which I did last night and point it to this blog. I love the phrase. It immediately strikes the incongruous tone of an oxymoron but the careful observer quickly sees that it is actually sadly true and I think this truth is particularly poignant to one who spends large amounts of time in automation like myself.

A brief note on the .io TLD. Why did I register under .io? Well besides the fact the .com/.org were taken by domain parkers, my understanding is that the .io will allow my content to be more accessible to the young and hip. Why just look at my profile pic to the right and then switch to mattwrock.com and look again. The nose hairs are way sexier on the .io sight right?! Sorry ladies but I’m taken.

I digress..so before the press releases, media events and other fan fare that will inevitably follow this “rebranding” of my blog, I thought I’d take some time to reflect on this phrase and why I think it resonates with my career and favorite hobby.

What do you mean “wait”? Isn’t that contrary to automation?

Technically yes, but a quote from Star Trek comes to mind here: “The needs of the many outweigh the needs of the few.” It is the automation engineer who takes one for the team to sacrifice their own productivity so that others may have a better experience. Yes, always putting others before themselves is the way of the automation engineer. There is no ego here. There is no ‘I’ in automation. Oh wait…well…it’s a big word and there is only one at the end and its basically silent.

I could go on and on about this virtuous path (obviously) but at least my own experience has been that making things faster and removing those tedious steps takes a lot of effort, trial and error, testing and retesting, reverse engineering and can incur quite a bit of frustration. The fact of the matter is that automation often involves getting technology to behave in a way that is contrary to the original design and intent of the systems being automated. It may also mean using applications contrary to their “supported” path. In many scenarios we are really on our own and traveling upstream.

So much time, so little code

I’ve been writing code professionally for the past fifteen years. I’ve been focusing on automation for about the last three. One thing I have noticed is that emerging from solving a big problem I often have much less code to show for myself than I would in more “traditional” software problems. Most of the effort involves just figuring out HOW to do some something. There may be no or extremely scant documentation covering what we are trying to accomplish. A lot of the work involves the use of packet filters, and other tooling that can trace file activity, registry access or process activity and lots and lots of desperate deep sea google diving where we come up for air empty. When all is said and done we may have just a small script or simply a set of registry keys.

Congratulations! You have automated the thing! Now can you automate it again?

This heading could also be entitled so little code, so much testing but this one’s a bit more upbeat I think.

Another area that demands a lot of time from this corner of the engineering community is testing. Much of the whole point of what we do is about taking some small bit of code and packaging it up in such a way that is easily accessible and repeatable. This means testing it, reverting it (or not) and testing it again. The second point, reverting it (or not), is absolutely key. We need to know that it can be repeatedly successful from a known state and sometimes that it can be repeatable in the “post automation” state without failing. The fancy way to describe the later is idempotence.

Maybe I’m actually lucky enough to solve my problem quickly. I high five my coworker Drew Miller (@halfogre) but then he refuses to engage in the chest bumping and butt gyrating victory dance which to me seems a most natural celebration of this event. But alas…I wave good bye to sweet productivity as I wait for my clean windows image to restore itself, test it, watch it fail due to some transient network error, add the necessary retry logic and then watch it fail again because it cant be executed again in the state it left the machine. So there goes the rest of that day…

Why bother?

Good question. I have often asked myself the same. The obvious answer is that while automating a task may take 100x longer than the unautomated task itself, we are really saving ourselves and often many others from the death of 1000 cuts. The mathematicians out there will note the difference between the “100” and “1000” and correctly observe that one is ten times the other. Of course this ratio can fluctuate wildly and yes there are times when the effort to automate will never pay off. It is important, and sometimes very difficult, to recognize those cases especially for those ADD/OCD impaired like myself.

I have seen large teams under management that rewarded simply “getting product out the door” with the unfortunate byproduct of discouraging time devoted to engineering efficiencies. This is a very slippery slope. It starts off with a process that garners a bit of friction but through years of neglect becomes a behemoth of soul sucking drudgery inflicted on hundreds of developers as they struggle to build, run and promote their code through the development life cycle. Even sadder is that often those who have been around the longest and with the most clout have grown numb to the pain. Like a frog bathing in a pot of water slowly drawn to a boil, they fail to see their impending death, and they don’t understand outsiders that criticize their backwards processes. They explain it away as being too big or complex to be simplified. Then when it finally becomes obvious that the system must be “fixed” it is a huge undertaking involving many committees and lots of meetings. MMmmmmmmm…committees and meetings…my favorite.

The best part is its magic

But beyond the extreme negative case for automation portrayed above, there is a huge up side. I personally find it immensely rewarding to take a task that has been the source of much pain and suffering and watch the friction vanish. There is a magical sensation that comes when you press a button or type a simple command and then watch a series of operations unfold and then unfold again that once took a day to set up. I wont go into details on the sensation itself, this is after all a family blog.

Want to keep the best and recruit the best? Automate!

In the end, everyone wants to be productive. There is nothing worse than spending half a day or more fighting with build systems, virtualized environments and application setup as opposed to actually developing new features. Eventually teams inundated with these experiences find themselves fighting turnover and it becomes difficult to recruit quality talent. Who wants to work like this?

Wait a second…haven't I just been describing my job in a similar light? Fighting systems not meant to be automated and getting little perceived bang for my coding buck? Kind of, but these are actually two distinctly different pictures. One is trapped in an unfortunate destiny and the other assumes command over destiny at an earlier stage of suffering in efforts to banish it.

So what are you waiting for?…Hurry up and wait!

Getting Ready…Troubleshooting unattended windows installation by Matt Wrock

ready.PNG

I install windows (and linux) A LOT in my role at CenturyLink Cloud automating our infrastructure rollout and management. Sometimes things go wrong. Usually if our provisioning code has been waiting for more than a few minutes for the machine to be reachable I know something is not right. So I might pop open a VMWare console and see this ever familiar screen. The windows installation is “Getting Ready.” That may fill one with the adrenaline of sweet anticipation but I know this only ends in disappointment. I can assure you that if windows is not ready now, it will never be ready. As in never, ever, ever ready.

In the past I have sat staring into the spinning circle of emptiness wondering what in gods name is windows doing. There are no error messages and usually nothing helpful in the VMWare events other than telling me that the OS customization has failed. Mmm…thanks. Sometimes after 5 or 15 minutes, the OS may come to life but often not in a state that our provisioning can connect to over winrm. I’m usually caught off guard by this since I have been spending the past several minutes in a very intense Vulcan mind meld with my monitor. Hoping somehow to break through and thinking I’m just beginning to feel the silent, cold, lonely suffering of a failed domain join when suddenly I am asked to press ctrl+alt+delete. Well…ok…I will…and slowly, as if just awoken from one of those inception dreams within a dream within another dream and having aged hundreds of years, I type just that – ctrl+alt+delete.

OK. You got me. Ctrl+Alt+Del does not work in a VMWare console, but you get the idea. Anyhoo, I next run off to the event logs reading lot and lots of events that are entirely unhelpful and provide no clues. Usually this all ends up being some stupid error like providing a faulty domain admin password to the unattend file. Not too long ago we added code to our windows provisioning that adds a second NIC and that introduced a few issues leading to this phenomenon until I got the sequence just right of adding the NIC, disabling it, configuring it and enabling it. But a couple weeks ago I ran into a new issue that really stumped me and I was not able to solve by looking over my provisioning code or configuration data. This prompted me to research how to get to the bottom of what's going on when Windows is “Getting Ready.” In this post I will cover what I learned and hopefully reveal clues that can help others figure out how to get out of these installation hang-ups

Overview of CenturyLink Cloud’s server provisioning sequence

It may help to point out roughly how we go about installing our windows boxes. Our methods may be different from yours but that should be irrelevant and the techniques here to troubleshooting windows installation hangs and errors should be just as applicable to just about any unattended windows install. Our windows servers do run server 2012 R2 so older OSs may certainly be different.

We have been using chef for our server automation and, in particular, Chef-Metal for our provisioning process. We have written a custom Chef-Metal Vsphere driver that leverages the RBVMOMI ruby library to interact with the VMWare VSphere API that does all the footwork of going to the right host, cloning a initial VM template, hooking up the right data stores, setting up initial networking etc. This also calls into VMWare’s guest OS customization configuration which will produce a windows unattend.xml file. Also known as an answer file. The VMWare tools will inject this file into the setup which windows will then use to drive its installation.

Our unattend file ends up being pretty simple. It performs a domain join and runs some scripts that tweak winrm so our provisioner can talk to the machine, install the Chef client and kick-off the appropriate cookbooks and recipes making the machine a “real boy” in the end. We run a mix of windows and linux but everything goes through this same sequence but of coarse the linux boxes don’t have unattend.xml files generated but they do have their own OS customization process that configures initial networking.

If everything goes right. This takes about 5 minutes from the initial cloning until the machine can receive network traffic and begin its convergence to whatever role that machine will fill: web server, rabbitMQ server, CouchDB server, etc. It really doesn't matter if its windows or linux, 5 minutes is roughly the norm. BTW: for most of our automation testing of linux machines we use Docker which is nearly instantaneous but we do not use that in production (yet).

Breaking through Getting Ready

So what can one do when the windows install gets “stuck” in this Getting Ready state? Shift-F10 is your friend. I don’t think it matters what hypervisor infrastructure you are using or even if this is a bare metal install. We use VMWare but this should work on Hyper-V, VirtualBox, etc. Shift-F10 will immediately open a CMD.exe as administrator if typed during the unattended install phase.

From here you can start pouring through logs and can even open regedit and other gui based tools if necessary but this command prompt is usually enough to find out what is happening.

Where are the logs?

As I have stated above, I have personally not found the VMWare events or the machine event logs to be much help. Your mileage may vary but you are likely going to want to find the unattend activity log which is located, of course, in

c:\windows\panther\UnattendGC\setupact.log

I don’t know what Panther is. I like to think there was some MS windows team back in the early 90’s that called themselves the panther team pioneering the way forward in windows automation. I also like to think they used gang-like panther calls to communicate with one another when spotting each other in the cafeteria or the campus store. They may have worn special jackets with the wild face of a panther on the back and perhaps some had tattoos or some form of tribal scarification applied resembling panther like imagery. Who knows…I can only guess.

At least in my case this is where the answers were found. Certainly they will be here if the issue is related to the domain join which mine usually tend to be. If the authentication with the domain admin account is at fault, that should be clear here. For instance:

2014-09-06 22:30:10, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:30:20, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:30:30, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:30:40, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:30:51, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:01, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:11, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:22, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:32, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:42, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...

The key above is the hex error code. Given the nature of the hexadecimal numeric format, the root is often immediately obvious and if not a google search usually points you to a more specific message.

In my recent stump scenario, the issue was that the domain controller could not be found. It ended up that although I was explicitly giving the domain controller IPs as the DNS servers to use, I was assigning the machine IP via DHCP and the DHCP server pointed to a different pair of DNS servers. For whatever reason, windows was choosing to use those servers and therefore unable to resolve the domain name to its correct domain controllers. There is also many other non-domain join details to be found here as well.

Other log locations that may be helpful

If for whatever reason, the unattend activity log does not have helpful information, there are a few more places to look. All files and subdirectories under:

c:\windows\panther
c:\windows\debug
c:\windows\temp

If you too are using the VMWare tools to drive the OS customization, you will find logs specific to VMWare’s work in c:\windows\temp. Many of the logs in the directories mentioned above may duplicate one another but some may have more granular detail than others.

I certainly hope this helps. If it does and you so happen to spot me in a crowd, let out a wild panther shriek and I promise to return with the same.