Detecting a hung windows process / by Matt Wrock

A couple years ago, I was adding remoting capabilities to boxstarter so that one could setup any machine in their network and not just their local environment. As part of this effort, I ran all MSI installs via a scheduled task because some installers (mainly from Microsoft) would pull down bits via windows update and that always fails when run via a remote session. Sure the vast majority of MSI installers will not need a scheduled task but it was just easier to run all through a scheduled task instead of maintaining a whitelist.

I have run across a couple installers that were not friendly to being run silently in the background. They would throw up some dialog prompting the user for input and then that installer would hang forever until I forcibly killed it. I wanted a way to automatically tell when "nothing" was happening as opposed to a installer actually chugging away doing stuff.

Its all in the memory usage

This has to be more than a simple timeout based solution  because there are legitimate installs that can take hours. One that jumps to mind rhymes with shequel shurver. We have to be able to tell if the process is actually doing anything without the luxury of human eyes staring at a progress bar. The solution I came up with and am going to demonstrate here has performed very reliable for me and uses a couple memory counters to track when a process is truly idle for long periods of time.

Before I go into further detail, lets look at the code that polls the current memory usage of a process:

function Get-ChildProcessMemoryUsage {
    param(
        $ID=$PID,
        [int]$res=0
    )
    Get-WmiObject -Class Win32_Process -Filter "ParentProcessID=$ID" | % { 
        if($_.ProcessID -ne $null) {
            $proc = Get-Process -ID $_.ProcessID -ErrorAction SilentlyContinue
            $res += $proc.PrivateMemorySize + $proc.WorkingSet
            $res += (Get-ChildProcessMemoryUsage $_.ProcessID $res)
        }
    }
    $res
}

Private Memory and Working Set

You will note that the memory usage "snapshot" I take is the sum of PrivateMemorySize and WorkingSet. Such a sum on its own really makes no sense so lets see what that means. Private memory is the total number of bytes that the process is occupying in windows and WorkingSet is the number of bytes that are not paged to disk. So WorkingSet will always be a subset of PrivateMemory, why add them?

I don't really care at all about determining how much memory the process is consuming in a single point in time. What I do care about is how much memory is ACTIVE relative to a point in time just before or after each of these snapshots. The idea is that if these numbers are changing, things are happening. I tried JUST looking at private memory and JUST looking at working set and I got lots of false positive hangs - the counts would remain the same over fairly long periods of time but the install was progressing.

So after experimenting with several different memory counters (there are a few others), I found that if I looked at BOTH of these counts together, I could more reliably use them to detect when a process is "stuck."

Watching the entire tree

You will notice in the above code that the Get-ChildProcessMemoryUsage function recursively tallies up the memory usage counts for the process in question and all child processes and their children, etc. This is because the initial installer process tracked by my program often launches one or more subprocesses that do various bits of work. If I only watch the initial root process, I will get false hang detections again because that process may do nothing for long periods of time as it waits on its child processes.

Measuring the gaps between changes

So we have seen how to get the individual snapshots of memory being used by a tree of processes. As stated before, these are only useful in relationships between other snapshots. If the snapshots fluctuate frequently, then we believe things are happening and we should wait. However if we get a long string where nothing happens, we have reason to believe we are stuck. The longer this string, the more likely our stuckness is a true hang.

Here are memory counts where memory counts of processes do not change:

VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 0
VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 1
VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 2

These are 3 snapshots of the child processes of the 2012 Sql Installer on an Azure instance being installed by Boxstarter via Powershell remoting. The memory usage is the same for all processes so if this persists, we are likely in a hung state.

So what is long enough?

Good question!

Having played and monitored this for quite some time, I have come up with 120 seconds as my threshold - thats 2 minutes for those not paying attention. I think quite often that number can be smaller but I am willing to error conservatively here. Here is the code that looks for a run of inactivity:

function Test-TaskTimeout($waitProc, $idleTimeout) {
    if($memUsageStack -eq $null){
        $script:memUsageStack=New-Object -TypeName System.Collections.Stack
    }
    if($idleTimeout -gt 0){
        $lastMemUsageCount=Get-ChildProcessMemoryUsage $waitProc.ID
        $memUsageStack.Push($lastMemUsageCount)
        if($lastMemUsageCount -eq 0 -or (($memUsageStack.ToArray() | ? { $_ -ne $lastMemUsageCount }) -ne $null)){
            $memUsageStack.Clear()
        }
        if($memUsageStack.Count -gt $idleTimeout){
            KillTree $waitProc.ID
            throw "TASK:`r`n$command`r`n`r`nIs likely in a hung state."
        }
    }
    Start-Sleep -Second 1
}

This creates a stack of these memory snapshots. If the last snapshot is identical to the one just captured, we add the one just captured to the stack. We keep doing this until one of two thing happens:

  1. A snapshot is captured that varies from the last one recorded in the stack. At this point we clear the stack and continue.
  2. The number of snapshots in the stack exceeds our limit. Here we throw an error - we believe we are hung.

I hope you have found this information to be helpful and I hope that none of your processes become hung!