Extract TFS Pending Changes to a zip file by Matt Wrock

Our TFS server was down today and I needed to get a Shelveset to a tester. Playing with the Power Tools PowerShell CmdLets I was able to basically pipe my pending changes to a zip file and give that to the tester.

Getting the Power Tools PowerShell Cmdlets

If you have Visual Studio Update 1 or Update two, you can download the TFS Power Tools from the Visual Studio Gallery here. If you have an older version of Visual Studio, download the Power Tools from the Microsoft Download Center here.

When you install the MSI, make sure to opt in to the PowerShell Cmdlets. The default options will not include the PowerShell Cmdlets!

Accessing the Cmdlets

After you have installed the power tools, you can launch the Power Tools Powershell console from the start menu item created during install. However if you are like me and have your own shell preferences, simply import them into your shell:

Import-Module "${env:ProgramFiles(x86)}\Microsoft Team Foundation Server 2012 Power Tools\  Microsoft.TeamFoundation.PowerTools.PowerShell.dll"

You may want to add this to your PowerShell Profile.

Extracting Pending Changes

Run the following commands from the root of your workspace to pipe all pending changes to a zip.

Add-Type -AssemblyName System.IO.Compression.FileSystem,System.IO.Compression$here= (Get-Item .).FullName$archive = [System.IO.Compression.ZipFile]::Open(  (Join-Path $here "archive.zip"), [System.IO.Compression.ZipArchiveMode]::Create)get-tfsPendingChange | % {   [System.IO.Compression.ZipFileExtensions]::CreateEntryFromFile(    $archive, $_.LocalItem, $_.LocalItem.Substring($here.Length+1)  ) }$archive.Dispose()

This simply adds the .net types to the powershell session and uses them to create a ZipArchive and then calls the Get-TfsPendingChanges to get a list of all files with pending changes. This is sent to a Zip file which in this case is called archive.zip and located in your current folder.

Requires .NET 4.5 and Powershell 3

The above zip file api makes use of the new, more friendly api for creating zip files. This will not work if you have .net 4.0 or lower. Also, since Powershell versions prior to 3.0 use the .NET 2 runtime, they will not be able to load .net 4.5 types. PowerShell 3.0 comes automatically unstalled on windows 8 and Server 2012. You may download and install the Windows Management Framework 3.0 here on Windows 7 or Server 2008 R2 to get Powershell 3.0 on those operating systems. You may get the .net 4.5 runtime here.

Easily Script Machine Reinstalls with Boxstarter by Matt Wrock

boxLogoAlmost a year ago now I started this small project, Boxstarter.  The project has sustained many interruptions as I have had busy spurts at work and involvements with other projects. It has gone through some serious refactorings moving from a scrappy script to a larger codebase organized into a few PowerShell modules and a suite of unit tests. A lot of its original code I moved over to Chocolatey and I plan to continue to contribute to Chocolatey where it makes sense. Chocolatey Rocks! So overall, I have probably rewritten this thing twice and now I feel it is far from finished but represents a nice base that makes “Box Starting” a much simpler and repeatable process.

Repaving Sucks

I think that pretty much says it. The idea of creating Boxstarter came to me when I had a SSD die and had to do a dreaded repaving of my machine shortly followed by installing several Windows 8 revisions from Consumer Preview to RTM. At the time, I was in the middle of a project where I was using powershell to automate a 50 page deployment document. So I knew if I can automate installing AppFabric, network shares, multiple web apps and other infrastructure settings, surely I can script the build of my own PC.

Then I found Chocolatey

So as I was looking in to how I could setup a fully functioning dev environment on one box to be just as  I left it on another, I inevitably discovered Chocolatey. Chocolatey is built on top of Nuget but instead of maintaining library packages for you dev project, it manages machine wide software package installations. This is good for several reasons:

  • Its plain simple to install apps that can be tedious to install on your own. Instead of hunting around the internet for the download page, forgetting to uncheck the animae tool bar download and waiting three minutes to click the next button, just type CINST <your app> and be done with it. Next time its time for a mega Visual Studio install session, save yourself and use CINST VisualStudioExpress2012Web.
  • Now lets say you have a bunch of apps you installed with Chocolatey and want to just update everything. Simply type CUP ALL.
  • The very best thing of all: create a “meta-package” or package.config and now you can install all of your apps in one go. Chocolatey just iterates the list and installs everything one by one along with all of their dependencies.

If you have not heard of or have not used Chocolatey, do yourself a favor and install it now.

What is Boxstarter? Chocolatey Tailored Specifically for Fresh Machine Installs

Chocolatey is awesome, but having done a TON of experimentation with automating new machine setups of all sorts of flavors, OSs and complexity, I have learned that setting up an environment can be much more than simply running a chain of installers.

Let me quickly list the benefits of Boxstarter and then I’ll dive into a few highlights:

  • Ensures the entire install session runs as administrator. This avoids occasional prompts to elevate your shell and limits it to just one at the beginning assuming you are not already running as admin.
  • Shuts down the Windows Update Service and Configuration Manager if installed during the install session. These can often interfere with installations causing installs to fail because either an update is blocking the install you are trying to run or they install patches that require a reboot before other software can be installed.
  • Can discover if there is a pending reboot and will reboot the machine and restart the install if asked to reboot. If written correctly, the install will pretty much start from where it left off. Further, Boxstarter can automatically log you in so you don’t have to stick around.
  • Boxstarter handles the initial installation of Chocolatey and if you are on a fresh win7 or server 2008R2, it will install .net 4.0 first which is a Chocolatey prerequisite.
  • Provides a bunch of helper functions for tweaking various windows settings.
  • Automates installation of critical windows updates.
  • Makes it easy to setup a local Boxstarter repo on your network so any connected machine can kickoff a install with one command.
  • Provides helper functions making it easy to create your own Boxstarter package.

The Boxstarter Killer Feature: Handling Reboots

I used to spend hours tweaking my install scripts, playing with ordering and various tricks to avoid having to reboot. There finally came a point when I realized this was pointless. Win8/Server2012 are a lot more reboot resistant but are still prone to them. Things get worse here when you are installing patches and more complicated apps like Visual Studio an/or SQL Server. I have realized that Reboots happen and can be unpredictable so the only thing to do is be able to deal with them.

The challenges are making sure the install scripts picks up right after restart, ensuring that the script does not spark a UAC prompt and block the setup, have it securely store your credentials so that it automatically logs back on after reboot but turns off auto logins after the script completes.

Boxstarter does all of these things. As a Boxstarter package author, you simply need to compose your packages to be repeatable. This means you should be able to run it again and again without error or data loss and ideally it should skip any setup processes that have already been run.

What is a Boxstarter Package?

Its just a Chocolatey package, but its intent is usually to either install a fresh environment or to lay down a complicated install chain that is highly prone to needing one or even several reboots. You can store them locally, on Chocolatey or on Myget or anywhere else you configure Boxstarter to look.

Show me the Code

First. Install Boxstarter.  The easiest way to do this is to install Boxstarter.Chocolatey from Chocolatey or download the zip from the CodePlex site and run the setup.bat. This installs all dependent modules and puts them in your user module path.

Next create a package, build it and deploy your repository to b consumed from anywhere in your network or even a thumb drive. Like this:

#After extracting Boxstarter.1.0.0.zip on MYCOMPUTER
.\setup.bat
Import-Module $env:appdata\boxstarter\Boxstarter.Chocolatey\Boxstarter.Chocolatey.psd1
#Create minimal nuspec and chocolateyInstall
New-BoxstarterPackage MyPackage
#Edit Install script to customize your environment
Notepad Join-Path $Boxstarter.LocalRepo "tools\ChocolateyInstall.ps1"
#Pack nupkg
Invoke-BoxstarterBuild MyPackage

#share Repo
Set-BoxstarterShare
#Or Copy to thumb drive G
Copy-Item $Boxstarter.BaseDir G:\ -Recurse

#Logon to your bare Windows install
\\MYCOMPUTER\Boxstarter\Boxstarter Mypackage

#Enter password when prompted and come back later to find all your apps installed

 

Now lets look at what an install package might look like

Install-WindowsUpdate -AcceptEula
Update-ExecutionPolicy Unrestricted
Move-LibraryDirectory "Personal" "$env:UserProfile\skydrive\documents"
Set-ExplorerOptions -showHidenFilesFoldersDrives -showProtectedOSFiles -showFileExtensions
Set-TaskbarSmall
Enable-RemoteDesktop

cinstm VisualStudioExpress2012Web
cinstm fiddler
cinstm mssqlserver2012express
cinstm git-credential-winstore
cinstm console-devel
cinstm skydrive
cinstm poshgit
cinstm windbg

cinst Microsoft-Hyper-V-All -source windowsFeatures
cinst IIS-WebServerRole -source windowsfeatures
cinst IIS-HttpCompressionDynamic -source windowsfeatures
cinst IIS-ManagementScriptingTools -source windowsfeatures
cinst IIS-WindowsAuthentication -source windowsfeatures
cinst TelnetClient -source windowsFeatures

Install-ChocolateyPinnedTaskBarItem "$env:windir\system32\mstsc.exe"
Install-ChocolateyPinnedTaskBarItem "$env:programfiles\console\console.exe"

copy-item (Join-Path (Get-PackageRoot($MyInvocation)) 'console.xml') -Force $env:appdata\console\console.xml

Install-ChocolateyVsixPackage xunit http://visualstudiogallery.msdn.microsoft.com/463c5987-f82b-46c8-a97e-b1cde42b9099/file/66837/1/xunit.runner.visualstudio.vsix
Install-ChocolateyVsixPackage autowrocktestable http://visualstudiogallery.msdn.microsoft.com/ea3a37c9-1c76-4628-803e-b10a109e7943/file/73131/1/AutoWrockTestable.vsix

 

Whats going on here?

Boxstarter installs critical updates, sets your powershell execution policy to unrestricted, makes windows explorer usable, installs some great apps, installs some of your favorite windows features, moves your Documents library to skydrive (I love this for the truly portable desktop), installs your favorite VS extensions and sets up things like pinned items and task bar size preference.

A lot of this functionality comes to you compliments of Chocolatey and others are specific to Boxstarter.

What’s Next?

As I see it, this is just the absolute base functionality so far. There is so much more to be added to make an installation process truly seamless. Here are some features I plan to begin soon:

  • Create a Boxstarter package automatically based on software already installed and windows features turned on to mimic these on another machine.
  • While Boxstarter can be installed and run either on bare metal or a VM, I want to make Boxstarter the powershell/hyper-v equivalent of Vagrant. Making the deployment of scripted VMs simple.
  • Add a one-click installer making a an easy one liner install command possible remotely.

There is a lot of Boxstarter functionality I have not covered here. I plan to blog fairly regularly providing brief posts describing various ways Boxstarter can augment your automation. If you want to learn more now. Checkout the Boxstarter Codeplex site which has complete documentation of all features and commands.

Is changing an API or design solely for testability a good practice? by Matt Wrock

imageThere is a personal story behind this topic that I want to share. About five years ago I heard about this thing called Test Driven Development (TDD). For anyone unaware of this, it is where you write failing tests first and then write “the code” later to make them pass. It immediately struck me as interesting and the more I learned about it the more I liked it. It seemed like much more than a means of testing code but a design style that could transform the way we write software.

As I actually began to practice this discipline, it did completely change the way I write code and approach design. It has been very challenging and rewarding. I don’t at all consider myself polished or advanced but I have drunk the Kool-Aid and hey, who doesn’t like Kool-Aid!? I cover more specifics later but please, let me reminisce.

At the time when I was discovering this technique, I purposely sought out a team of developers who were TDD practitioners. As someone who is self taught and prefers to learn on my own, this was an area I knew I needed to learn from others to better get my head around the patterns. So for almost three years I was immersed with my new team. We often discussed the virtues of strong unit test coverage and TDD. When you roll with a group that all share a common set of values, it is easy to feel “right” and take some nuances for granted. Even if they are right.

Maybe the world IS flat

Six months ago I changed teams. The new team was different from my former team. One difference was a lack of unit testing. There were lots of tests but they were essentially integration tests and took a long time to run. So long that they could not be run as part of the build. Tests were always written after coding and one reason why was that the code itself was nearly impossible to test using typical Unit Testing techniques.

One of my initial thoughts going in was that this would be an excellent opportunity to make a difference for the better. I still believe this but I was very surprised at how some of my techniques and the patterns that I had come to embrace were questioned and given some rather skeptical critiques. Suddenly it felt that the things I had come to value in standards of design and build methodologies was a currency not honored in this foreign land.

Now it would have been one thing if I was working for some mediocre outfit of 9 to 5er developers. However many of these were people I consider to be very smart and passionate about writing good software. It was like having a bunch of Harvard grads insisting the earth is flat and watching yourself begin to question if the world really is round. Sure does look flat when you really look at it.

Not everyone likes chocolate and some who don’t are smart

Sidebar: I just learned today that Jonathan Wanagel, who runs Codeplex and is one of the smartest people I know in the whole wide world does not care much for chocolate. Interesting…

What is this Hippie code?

Here is a list of many of the things that turned some people off:

  • Interfaces – adding too many types to the API, kills Visual Studio’s F12 and overkill when there is only a single production implementation.
  • Unsealing classes, adding virtual methods or making some private/internal types public – Now we have to support a larger surface area and make sure customers do not get confused or shoot themselves in the foot.
  • Eliminating static classes – now I need to instantiate a class to access a simple utility or helper method.
  • Test methods with long method names resembling a sentence describing what is being tested –that’s just weird.
  • A general concern expressed by many - We should not have to change code and especially an API just to add testability.

Some of these concerns are very valid. Honestly they are all valid. Even though I completely disagree with some of the opinions I have encountered, it must be remembered that introducing new ideas will be naturally distasteful to some if they do not understand the intent and it is therefore incumbent upon the bringer of the new ideas to clarify and articulate why such ideas might have value. I’m not claiming “mission accomplished” but this forced me to do a lot of “back to basics” research to understand how to communicate the value. While I feel the value deep in my bones, it has been very challenging to learn how to express the value to others.

Lets play Devil’s Advocate

Before I dive into the responses I have developed and am still developing for the above reactions to a different style of design I would like to spend some time defending the critics and skeptics. I work with very intelligent people and I respect the fact that they demand to understand why someone suggests a change in style and a radical change in some respects.

Personally I spent more than half my career not practicing TDD or writing proper unit tests. I too was very conservative about what my API exposed. I liked to write APIs with few classes. I liked “noun” style classes (employee, order, etc) with their own persistence logic and static Create methods. I thought this was elegant and a lot of others think so too. And honestly, I wrote some code that made some people a lot of money during this time.

Sealed classes, internal methods, oh boy!

Among the crowd I run with on the twitters, there is a lot of nay saying around sealed classes and internal methods. Don’t expect me to be changing my Add New Class template to create them sealed by default but I do think this deserves some thought. My background is largely in web technologies. This often means that my server code will never be exposed to anyone outside of my team. It is easy to adopt much looser rules around api exposure when you are the only consumer.

I am also involved in open source software. We are a small cadre of developers and we are not representative of the average developer. I know. That sounds elitist but it is true. I’m used to reading source code in lieu of documentation (not necessarily a good thing). I’m perfectly comfortable pulling down someone else’s code using any of a half dozen source control providers. And like many of my OSS peeps, I get extremely annoyed when I am trying to work with an API, find some code that looks to be just what I need and then see it is not accessible. I’d rather have a larger api that provides me heightened flexibility and extensibility over a smaller and easier to understand API – within reason of course.

However, when you work on software that physically ships to enterprise customers, you really do need to broaden your view. I might work for Microsoft, but if you call me an enterprise customer, I’ll cry. The fact is, it is important to understand your audience. Most developers don’t want to concern themselves with the innards of your code. That’s why they buy it. It should do everything they need it to do and be easy to figure out and difficult to misuse. No matter who your audience is, the public API is one of the most important things to get right. It should read like documentation and be self explanatory. Sometimes this means putting a curtain over large parts of your codebase. I’m still coming to terms with this, but I do believe it is a reality that deserves attention.

Testability for its own sake – That’s just fine

I used to feel an uneasiness when discussions would take this turn into warnings about the dangers of making code testable simply for its own sake. I would feel a sense of guilt about asking others to work harder just to make things easier to test. I’m over that. This is like arguing against quality for its own sake or simplicity just to be simple. If I have to tweak an API to make it testable, we have to remember that we are not only doing ourselves a favor but everyone who will be using our software can now test around it as well.

While TDD and similar practices have been fairly mainstream in other communities for a while, it is becoming more so in the Microsoft dominated technologies (where I work) and much more so than it was just a few years ago. We need to understand that testability ships as a feature to our customers. Software that is difficult to test is not just perceived as a nuisance but its overall quality can be called into question by virtue of a lack of testability.

So what is so great about testability?

Perhaps I’ve gone far too long in this post before championing the virtues of testability not to mention some clear examples of what it is. I can hear others questioning: what are you talking about? We have QA staff and huge suites of test automation. Testability? Lets not get carried away.

Yes. Testability is very much about testing the code that you have written. Specifically, I am referring to the ability to test small single units of code (an IF block for example) without the side effects of surrounding code. I might have a method that queries a database for a value, sets the state of the application depending on the value queried and logs the result. I want to be able to write tests that can check that I set the application state appropriately but not that I got the right data from the database or that I successfully logged the activity. I’ll write other tests for that.

Having a code base well covered with these kinds of tests can create (but does not guarantee) a very high quality bar and allow developers to spend more time writing features and less time finding and fixing bugs. The more logical paths your code can take, the more important this is. It is easy to innocently introduce a small change and inadvertently break several logic paths. This level of code coverage is your safety net from bug whack-a-mole. Each release without this coverage brings increased surface area for bug creation until the cyclomatic complexity drives you to a saturation point and now you spend most of your time addressing bugs.

Good test coverage and code that is easy to test invites exploration and experimentation. It is the fence that keeps us from touching the third rail. Low coverage incites fear into the hearts and minds of good developers. “Hmm. That sounds like a really interesting approach but we don’t dare touch that code because it is core to our business and we cant afford for it to break.“

Testable code is more about good design than a test automation arsenal

This is something that can sound odd to the unindoctrinated. At least it was not what originally attracted me to TDD but it IS what has kept me here.

In order to write code that is easy, let alone possible, to test with this kind of granularity, one must enforce a strict separation of concerns because you want to test no more than a single concern at a  time. This may produce code that looks different to many teams and might look awkward at first glance. The code is more likely to have these traits:

  • More types. More classes, more methods and more interfaces. As one ensures that each type contains a highly focused and intent driven purpose. Rather than having a person class that not only represents the person but also does a bunch of stuff to and with a person, one may now have several classes that look more like verbs than nouns to represent different interactions with a person.
  • Smaller types and smaller methods. This goes hand in hand with the above. It does not take a genius (that’s why I figured it out) to discover that it is tough to test a method that does 20 things. And guess what? It is easier to read and understand too.
  • More layers of abstraction and points of extensibility. This may coincide with the mention of more interfaces. As you tease out corners of code to test, you need to be able to apply protective tape over surrounding machinery that should remain untouched by the test. This may be because these surrounding areas talk to out of process systems that would bog down the performance of a test or engage in complex logic that manipulates values that must interact with the code under test. It is easier to “plug in” lighter weight machinery that acts on data very predictably, repeatably and quickly. The use of interfaces, dependency injection and mocks/stubs/fakes/etc come into play here and may make one not used to them feel out of water or like they are over engineering. One may react that this abstraction seems silly. Why create an INamingService when we only have one naming service. First, that is a fair point and should not be ignored. It is possible to over engineer and you need to decide what level of abstraction your scenario calls for. That said, once you gain competence coding in this manner, you often find ways to exploit these abstractions into rich composition models that would not have been possible given a more monolithic class structure.
  • A larger surface area to the API. This is what many find the hardest to come to terms with and they should. There is A LOT to be said for a simple API and testability does not necessarily make this fate an inevitability. However it does make it more likely. With an application having more “building blocks” there may be a greater number of these blocks to interact with one another and exposing those interactions to the consumer may very well be a good thing. Also, just like you, your consumers writing code around your API may demand testability and the ability to abstract away all exposed blocks.

This design style facilitates change, improvement and happy developers

Code that is easy to compose in different ways is easier to change. This kind of a model allows you to touch smaller and more isolated pieces of your infrastructure, making change a less risky and dreaded endeavor. A team that is empowered to change and improve their code more rapidly makes for a happier team, happier business stake holders and happier customers.

Use modern tools for modern programming techniques

Ok. I’m gonna call TDD (because that’s usually what this methodology becomes more or less) a “modern programming technique.” There are tools out there that are designed to make these practices easier to implement. As a primarily C# developer, these tools include Resharper, an IOC for dependency injection, a good mocking framework and a modern test runner like XUnit.

Resharper makes a lot of the refactorings like extracting interfaces and finding interface implementations and their usages easier to discover. It provides navigational aides making work in a code base with more types much easier.

An IOC facilitates the creation of these types and makes it easier to manage either swapping out one type for another or discovering all implementations of a given type. One may find constructors with several types being injected which would be extremely awkward to “new up.” With a properly wired IOC, it handles the newing for you. Almost all IOCs (don’t create your own) provide solid lifetime management as well which provide the utility of singleton classes but without their untestability.

The wrong question

So lets return to the core question of this post. Is changing an API or design solely for testability a good practice? I would argue that we have not properly phrased the question. Begin by exploring your definition of “testability” and you may well discover that the “sake of testability” is not nearly all that you are after.

BigGit: Git for large and tall TFS Repositories by Matt Wrock

bIGgitLogoI work in Microsoft’s Developer Division (devdiv). We are the ones that make, among other things, Visual Studio. Our source control is in a TFS Repository (we make TFS too). A very large TFS repository and likely the largest in existence. I really don’t know how big exactly. I do know that the portion I work with is about 13GB and spread over dozens of workspace mappings. Before moving to this part of Microsoft, I had been a Git and Mercurial user for the past three years and learned to love them. My move to DVCS was more than an upgrade in tooling, it was a paradigm shift that transformed the way I work. I knew I would use TFS moving to devdiv but just figured I would use a bridge like git-tfs. After moving to my new TFS home and spending some time intentionally becoming familiar with TFS, I decided to break out git-tfs and get back my branching and merging goodness. It didn’t work. Then I became sad and then desperate. As I progressed through the various Kubler-Ross stages of grieving, I never did reach the final stage of Acceptance. This is the story of my journey that led me to create BigGit.

My hope for the reader

My hope for you, dear reader, is that this will be just a story and that you will have no personal need for BigGit. Unless you are a coworker, chances are you wont. But I do know there are others out there working in unwieldy TFS repos and wanting but unable to use git. Chances are if you are dealing with a repository this large, you are likely experiencing friction points that transcend just source control tooling.  You may come home each night feeling like your nipples have been rubbing against heavy grade sandpaper all day long. Unless you are one who likes that kind of thing, and I mean no disrespect if you do, hopefully you can apply some BigGit and soothe that chafing.

If you are interested in reading more about my personal experiences and impressions as I migrated from traditional centralized version control to DVCS, please read on. If you just want to quickly learn of a way you can possibly get your large TFS repo under git, then go to http://BigGit.codeplex.com and you will find lots of documentation on how it works and how to use it. I hope it helps.

My Introduction to DVCS

About three years ago, I reluctantly entered the into the world of diversified source control. I had been using SVN and was very happy. It had only been a couple years since I had moved off of Visual Source Safe (VSS) so I was easily pleased. I was finally feeling competent with the basics of branching and merging which were just bastardized and deformed children in VSS but first class concepts in SVN. When my organization adopted Mercurial in 2011, it was clear to me that it was the right decision. It was not difficult to read the tea leaves even then and see that DVCS was clearly the future.

The first month of using Mercurial was painful. I not only had to learn new commands but I had to internalize a different perspective on source control. This was not SVN. Everyone had their own complete repo and not just a “working folder” and then you had to deal with three way merges that took a bit to groc. There were times when I wondered if it was worth it and wished I could have my familiar SVN back. However I got the sense that I was working with something very powerful and if I could master it, I’d have a lot to gain.

I was not alone. In a team room with four developers, I probably ranked #2 in satisfaction. During our first month of ramp up mistakes were made. The kind of mistakes most will make and this post can not keep you from. You will try to do something fancy because you can. Something like a partial roll back or cherry picking commits and you will make some mistake that either deletes history or puts your repository into some weird and esoteric state that takes you hours to dig out of. You will want to blame the new stupid source control tool, and perhaps you will, to make yourself feel better.

Switching to the Command Line

As a source safe, SVN and early Mercurial user, I primarily used GUI tools to manage my source code. As DVCS systems go, Mercurial has a decent GUI. However, as I learned more about mercurial and delved into troubleshooting adventures, I became more uncomfortable with the way that the GUI tools hid a lot of the operations from me. When I click the “sync” button what really happens? After all, there is no Mercurial “Sync” command. I roughly knew about the concepts of pull/fetch/update but the Gui hid the details from me. This was all well and good until something goes wrong and as you delve into the problem, you find that you really do need to understand these concepts if you are going to work in this system day in day out. Perhaps the more casual user can cope with a GUI, but working in an active team of developers on various branches, I really felt a craving to understand how the system worked.

So I opened up the command line, put the HG directory in my path and never looked back. I personally believe this was not only a practice that improved my Mercurial competence, but it was the gateway to command line competence leading me to powershell and making me a better and more productive developer. And when we become better and more productive, we become happier.

When people new to DVCS would come to me with problems, they were almost always using a GUI exclusively and had bumped into something that they can possibly solve with the GUI but will not understand the problem using the artificial and dumbed down abstractions that the GUI enforces. In order to properly articulate the problem and solution, one needs to understand the commands and switches the GUI is calling. I always tell people to put down the GUI and then they look at me with crazy eyes. I love the crazy eyes. After their next couple blunders that we all make, I see that command line on their screen.

Its not so bad the command line. Quickly you find that most days you use no more than a half dozen commands anyways and you find that the GUI does not save much time and as you become more competent, you do use the GUI where it shines and stick to the command line for the rest.

DVCS Bliss: Branching and merging and time traveling

So what made me become such a firm advocate of Mercurial and eventually Git? There are so many things and I’ll quickly list some here:

  • Most operations are faster because they are carried out locally. Commits, branch creation and switching, examining diffs and history – these are all local operations that do not need to make a server call.
  • Getting a repo up and running is trivial. When I worked in the MSDN org, the msdn/technet forums, profile pages, search code, gallery platform (Visual Studio Gallery, MSDN Samples, etc) and more were using Mercurial. The “central” repository that everyone pushed to and the build system pulled from was maintained by the development team. Being one of the primary administrators, I can safely say that I spent about 5 minutes a month doing some kind of administrative task and it never went down or suffered performance issues. This is due to its simplicity. Its just a file.
  • The local repository is portable. I can move it around or delete it and there is no “server” component that cares about or tracks its location.

These are nice benefits but the biggest value to me of DVCS is the ease and speed of creating branches.This is another area where git outshines Mercurial. While branches are just as fast and easy to create on both, you can easily delete them in git while they can only be “archived” in Mercurial. Both git and Mercurial support excellent compression algorithms and merge tracking strategies. Creating branches in either is incredibly light weight from a disk space perspective. Since all branches are local, I can switch from branch to branch without an expensive server call to reassemble the branch.Using Mercurial or git, having 2 1GB branches that have only a few files of differing code will probably take up not much more than 1 GB for both branches. While TFS requires me to keep a full copy of both.

The merging algorithms of Git and Mercurial are also better than either SVN or TFS. One of the things that delighted me almost immediately when I made the move from SVN to Mercurial was that Merge conflicts dramatically reduced. I have had similar experiences with TFS and git. In systems like TFS and SVN, you might think twice about branching because of the overhead and friction they can produce, while git/mercurial makes it so easy that branching is often central to the workflow in these systems. It is very empowering to know that at any moment I can create a branch and do a bunch of experimentation. Later I can discard the branch or merge the changes back to my original branch. It is also easy to share branches. In a single command I can “push” a branch to a central repository where my fellow contributors can pull it and view my changes. Heck, you don’t even need a central repo, you can push/pull peer to peer.

Git and TFS interoperability

Clearly git has become widely popular among the greater developer community. Many developers find themselves using TFS during the day and git in their off hour work. For those who work in organizations using TFS, there are tools like git-tfs or git-tf that allow one to work in a git repository and bridge changes back and forth between git and tfs. This allows one to target a TFS server path and clone it to a git repository. They can do their day to day work in the git repo and then “checkin” to TFS periodically. These “bridging” technologies typically work great and make the transitions between git and tfs fairly transparent. I work on a couple small and isolated pockets of our TFS repository at work and I use git-tf to clone those areas of tfs to git. Once the directory is cloned, I am in a normal git repo. There are a lot of developers who work like this without a problem.

Large or dispersed repositories

There are a couple scenarios where this interoperability does not work so nicely. One is if your TFS workspace has a lot of mappings. Both git-tfs and git-tf can only clone a single TFS server folder. This is fine if all of your mappings fall under a single root folder that is of a manageable size. However, these multi mapping workspaces often have so many mappings precisely because the root is too large to map on its own. In my case, my 500GB hard drive would run out of space before I could finish syncing my root folder.

This touches on the second scenario preventing git to play nice with large repositories. Now large is a relative and subjective term. Lets qualify large as greater than 5 GB. The raw size is not the key issue but rather the number of files. Many of Git’s key operations perform a file by file scan of the repository to determine the status of each file. If you have 100s of thousands of files, this can be a real bummer. Keep in mind that git was built for the purpose of providing source control for the Linux kernel which is roughly about a couple hundred MB. This provides great performance for the vast majority of source repositories in the world today, but if you work in a repository like mine the perf can be abysmal.

Repository Organization

If you look at some large enterprise TFS or P4 repositories, it is not uncommon to find everything placed under a single repository. That works just fine for these systems and it can be easier and simpler to administer them this way. Because these are CVS (centralized versioning systems) based systems, the server acts as the single authoritative keeper of the assets. This server copy is not a file system but a database. In a large enterprise model where teams of DBAs administer the corporate databases, it makes sense to keep everything in a single data store. 500GB is not a large database and putting everything in one repo simplifies a lot of the maintenance tasks. Most CVS systems provide a protocol where the user often specifies the set of files to be acted upon. By doing this, there is no need to do a file scan because you are telling the system which files will be affected.

Of course many git operations can have individual files specified and many TFS commands can act recursively on an entire repository. However key operations like git status, commit and blame need to determine the status of each file in the repo. And on the TFS side, users become trained NOT to do a TF GET . /r on the root a 50GB folder. Do it once and trust me, you won’t want to do it again.

Having worked under both models, I do think there is a lot more to be said for keeping a repository small than just improved source control performance. It limits the intellectual territory of concern. For someone like myself with severely limited intellectual resources, this really matters. Honestly, this has a tangible impact on our ability to groc source. Here are some scenarios to consider:

  • You need to find code related to a bug. I often find myself looking for a needle in a haystack unless I am very familiar with the nature of the bug and the related source. It is just too easy to end up with a less than tidy code structure with a single tree.
  • In this “shared code” folder, what do I care about? If a single repo is maintained under a large division, there might be one or a small number of TFS folders for shared assets like runtimes, dependent libraries, and build tools. Eventually these folders can become huge and often consume the bulk of space. Eventually no single person or team can keep track of everything in it. One might map to this folder but have no idea what they actually need or don’t. Eventually in TFS, one might create “cloaks” to hide particularly large subdirectories but still have a long tail of perhaps hundreds of small unused directories that consume gigabytes of space. Who wants to create hundreds of cloaks or a hundred individual mappings for just the directories I need?

Of course one can organize a TFS repository into logical subtrees and that’s great if you can. However, it is just too easy to cross intellectual boundaries and having several repositories is a nice forcing function to keep the assets organized. Even if this means some duplication, I think that is a perfectly acceptable compromise for being able to quickly pull the code I care about. Disk space is much cheaper than developer hours spent looking for code.

Some possible ways to work with large git repos that did not work for me

So months ago when I began my investigation into how to get my huge TFS workspace under git control, I came across a lot of advice to individuals in similar circumstances. This advice most commonly was one or a combination of the following.

Garbage Collect the Repo

Think of this as a defrag tool for git. Over the course of working in a git repository, objects will become orphaned as they are dereferenced. For example, lets say you want to discard the last several commits with a git reset –hard or if you delete a branch. These objects do not disappear immediately. It may be weeks before git reclaims unreferenced objects. So it is possible that you have a repository with a bunch of stuff that will never be used again and can be discarded. If you want to invoke the git garbage collector immediately and trim away al unused objects, then call:

git gc –aggressive

This may take several minutes or even hours on a very large repository. It is possible that this can improve performance but if you find performance is poor right after a clone, it is unlikely that a GC is going to yield much of an improvement.

Sparse Checkout

This is somewhat new as of git 1.7. It allows you to clone a repository and then specify a part of the tree that you want to keep and git will eliminate the rest from the index and working tree. This may potentially solve some scenarios but it did not address my problem because I needed all 13GB of source.

SubModules

Submodules are separate git repositories that you can include in your repository.The intent here is for including libraries in your project that have their own git repository. You may be working on Project Foo that uses the Bar library. Bar has its own git repository that you want to contribute to and get changes from within your Foo repository. So you clone Bar into a subdirectory of Foo. This can work fine especially for the scenario it was intended for.

I didn’t want to have to maintain each second level directory as a separate repository. This can be especially awkward when branching. Each submodule has to be branched separately. I cant just branch the top level and expect everything below to be branched. Sub modules are really intended for working with logically separate libraries. Now if my large repository was structured in such a way where I did my own work in one collection of subdirectories and another team worked exclusively in their own dedicated subdirectory, Submodules may make more sense, but that was not the case for me.

Splitting the repository

Now we come to the solution that I finally settled on. After spending a couple months trying different things with Git-tf, git-tfs and an internal Microsoft tool, experimenting with various combinations of the above and not being satisfied with anything, I pretty much gave up trying to work in git. I found I was wasting too much time “fighting” the system and needed to move on and actually get some work done.Then about 2 months ago I had a flash of insight into an idea I thought might work and it did! This solution has one major assumption that must be true in order to work: a significant chunk of your TFS repository includes content that you do not regularly or directly contribute to and does not change frequently.

Well over 80% of my repository contains external libraries, build tools and native runtime bits. This is content that rarely changes and is not source that I contribute to or need to pull from regularly. Once I have the C runtime version X, I don’t need to sync that everyday. Part of this discovery emerged as I learned more and more about the nature and content of our repository.  As a new member to the org, staring at a repo with Millions of files not to mention trying to understand a new product, I really had no idea what I needed and did not need. I was pointed to a set of workspace mappings that I was told was what I needed and worked with that. Over time, I became more familiar with various parts but large swaths were still a mystery to me. I took a weekend and with the help of some sysinternals tools like process monitor and watching that as I ran several builds, I learned what I needed to know: what bits does my build actually use and how does it use them.

Roughly this is what I did:

  • I created one TFS workspace and added all mappings from my original workspace that contained files my team does not contribute to and rarely change. I call this the Nonvolatile workspace.
  • Create another TFS workspace for the remainder of the mappings that me and the larger team actually churn on and need to be synced regularly. I call this the Volatile workspace.
  • Ideally I’d like a git repository that tracked all files in the Volatile workspace but I have two problems: 1) Git-Tf and other bridging tools do not map to the mappings in a workspace but need to map to a single server path (like $/root/branch/path) and 2) I need the contents of both workspaces in one directory tree so that they can reference each other correctly. So the following steps take care of this.
  • I created a TFS “Partial” branch in my own feature area of my Volatile mappings.This is a way to create a branch that branches just a selective number of folders from the parent folder. This is documented by my coworker Chandru Ramakrishnan and you can find more examples on creating partial branches in the BigGit source which creates these for you. This solves the first problem above and allows me to map a git repository to a single TFS folder $/root/branches/mybranch and get just my Volatile content.
  • I used Git-Tf to clone my partial branch folder to a new Git repository. However I cant build from this since it is missing all the dependencies in my Nonvolatile workspace.
  • Next for each top level directory in my Nonvolatile workspace, I created a symbolic link in my git repository and I added each of these directories to my .git\info\exclusions file. This is the equivalent of the .gitignore file but it is not publicly versioned.and kept private to my own repo. This solves the second problem by putting all my content in one tree but git does not need to see the nonvolatile content.
  • Lastly I want a git remote where I can push/pull all my git branches. I don’t feel safe keeping it all on my local machine. Git-Tf only syncs the master branch with the TFS server. So I clone my git repository to a file server //someserver/share and I make this my Origin remote.

Here is a diagram to illustrate this configuration:

BigGit

My volatile content that is now tracked by git is about 1.5 GB. Yes, a big repository but still very much manageable in git. This is great! Look at meeee! I am working in Git!! But things are still a bit awkward.

  1. That was a lot of steps to set this up. What if I want to do the same on another machine or help others with the same setup. I want to be able to repeat this without errors and having to look up how I did it.
  2. Syncing changes between my Git repo and the parent TFS repository is a lot of work. I have to sync using Git-Tf with my partial branch and then do a TFS merge from my partial branch to the parent folder. If I want to get the latest changes from TFS I need to do a GET in the parent workspace, merge that to my partial branch, check the merge in and then use Git-Tf to pull my partial branch changes into git. Wow. Now I need a nap.

Automating away complexity with BigGit

BigGit is a powershell module I wrote to solve the setup and syncing of this style of repository layout. By importing this module into your powershell session you can either use several of BigGit’s functions to craft your repository or more likely what you will want to do is use the convenience Install-BigGit function to setup your repo in one step. You just need to know which mappings to direct to your volatile and nonvolatile workspaces and then BigGit will create everything for you. Even if your parent workspace is relatively small but requires a bunch of mappings, you can omit the nonvolatile mappings and BigGit will just setup all mappings as volatile and solve the multi mapping git mapping problem. Here is an example of using BigGit to setup your repository:

$myWorkspace = "MyWorkspace;Matt Wrock"$serverRoot="/root/branch"
$tfsUrl = "http://tfsServer:8080/tfs"

$NonVolatileMappings = Invoke-TF workfold /collection:$tfsUrl /workspace:$myWorkspace | ?{ 
    $_.ToLower() -match "(
        $serverRoot/external(/|:))|($serverRoot/build(/|:)
    )|(
        $serverRoot/lib(/|:))|($serverRoot/runtimes(/|:)
    )"
} 

$VolatileMappings = Invoke-TF workfold /collection:$tfsUrl /workspace:$myWorkspace | ?{ 
    $_.ToLower() -notmatch "(
        $serverRoot/external(/|:))|($serverRoot/build(/|:)
    )|(
        $serverRoot/lib(/|:))|($serverRoot/runtimes(/|:)
    )"
}

Install-BigGit -tfsUrl $tfsUrl `
                     -serverPathToBranch "`$$serverRoot" `
                     -partialBranchPath $/root/mypartialbranchlocation `
                     -localPath  c:\dev\BigGit `
                     -remoteOrigin \\myserver\gitshare `
                     -NonVolatileMappings $NonVolatileMappings 
                     -VolatileMappings $VolatileMappings

Here we compose our mappings into separate powershell variables and then use the Install-BigGit function to set everything up. Depending on the size of your original workspace, this could take hours to run but it should be a one time cost. Once this is done, You can sync your changes to TFS using one command that automates all the git-tf syncing and TFS merging discussed above:

Invoke-ReverseIntegration "some great changes here"

This DOES NOT do the final checkin to TFS. It stops at merging the changes into your parent TFS workspace. I thought it would be more desirable to let the user do the final TFS checkin and sanity check the files being checked in.

To pull in the latest changes from TFS use:

Invoke-ForwardIntegration

It is easy to install BigGit. While you are welcome to download or clone the source from codeplex (and contribute too!!), you may find it easier to download via Chocolatey:

iex ((new-object net.webclient).DownloadString('http://chocolatey.org/install.ps1'))
."$env:systemdrive\Chocolatey\chocolateyinstall\chocolatey.ps1" install BigGit

This will download just the module and modify your profile so that BigGit functions are always accessible.

If you want to find out more about BigGit, view the source code or discover the functions it exposes and read their command line documentation, you can find all of that at the codeplex project site. I hope this might help others that find themselves in the same predicament I was in. If you think that BigGit could stand some enhancements or that any analysis in this post is incorrect or lacking, I welcome both comments and pull requests.

Related reading

Why Perforce is more scalable than Git by Steve Hanov

Facebook’s benchmark tests using Git

Is Git recommended for large (>250GB) content repositories on StackOverflow

What to do when you cant access Sql Server as Admin by Matt Wrock

This happens to me on a particular VM setup framework we use at work about every couple months and I always have to spend several minutes looking it up. Well no longer I say. I shall henceforth document these steps so I will never have to wander the internets again for this answer.

Stop-Service mssqlserver -Force& 'C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn\sqlservr.exe' -m

sqlcmd -S "(local)"
CREATE LOGIN [REDMOND\mwrock] FROM WINDOWS WITH DEFAULT_DATABASE=[master]
GO
sp_addsrvrolemember 'REDMOND\mwrock', 'sysadmin'
GO
EXIT

Start-Service mssqlserve
 

This Stops sql server, then starts it in single user mode. You will then need to open a second shell to run the login creation script. When the user is created in the correct role, go back to the shell running the sql instance and exit via ctrl-C. Finally start sql server normally and you are good to go.