Sunday 13 June 2010

Suddenly, It's all about the Revision Control

Nerd Time is NOW, so normal folk should go grab some beers and get drunk, or alternately read the post and play "Guess-When-Zinc-forgot-the-point-he-was-going-to-make"

Aaannyway, this week I have been playing around with a Revision Control System on my laptop, a pleasing exercise in avoiding writing actual code.

When you're part of a team writing software, it's nice if everybody can see the code that is being worked on. In Olden Times (1992) at my very first job with computers, we just whacked all the source code into a shared directory on the network so everyone could see it, and that was that.

This lead to problems.

The biggest set of problems it causes are to do with Concurrency, or what happens when two many cooks are working on the same pan of broth.  One cook decides that the broth needs some salt and goes to get the fancy salt cellar from the bread bin (which is a perfectly sensible place to keep salt, I don't care what you think). While he's away at the other end of the kitchen, one of the other cooks also decides that the broth needs salt and wanders off to find wherever it is that his idiot co-workers (none of whom ever talk to each other) have hidden it. In the meantime, first cook comes back and adds salt. Delicious! However, second cook is still out there looking for the salt and eventually (after he either finds the salt or goes out to buy more) comes back and adds some more. Now it is too salty and all this sodium makes the customer's heart sad :(

Actually that was a terrible analogy but I'm leaving it in because I like soup. In programming terms, what can happen is that both programmers start out with a copy of the same file (version 1) and make their changes. Programmer A finishes his changes and saves the file, which is now Version 1.1A. Eventually, programmer B finishes his changes, but is unaware that Programmer A has changed the file already, so when he saves his changes, he overwrites Programmer A's changes and replaces Version 1.1A with his own version 1.1B.

You could fix this by implementing a file "locking" mechanism so that when programmer A starts making his changes, he locks the file under edit so that no-one else can touch it (or at least get lots of whiny prompts about the file being locked and read-only when they try). This still leaves you with a couple of problems, one new and one old.

The first problem with this is that it can lead to a situation where half your programmers are sat around being forced to search the internet for nude pictures of Karen Gillan because they all want to make changes to the same file. It rather makes having a central copy of the source a waste of time if it limits the numbers of programmers who can use it at any given time. The next thing you know, everyone's making local copies of the source "to work on" and things get  very bad very quickly.

The second problem is what happens when say two programmers are making changes to separate files, File A and File 2...  er, File B. Great, no problem! Except that the changes in the new version of File A are dependant on File B, and specifically they were dependant on the old version of file B, which just got changed. No-one's changes have been lost, but the central copy of the source is now in an uncompilable state if you're lucky (at least that way tracing the problem is relatively straightforwards) or worse still, crashes at runtime and you're going to have to do a boatload of debugging to find out what's going on.

How do you solve these problems? Well, hopefully you install a Revision Control System like Subversion. It won't exactly prevent these problems, but it helps you fix them when they show up.

How's it manage that then? Well, basically files stored in a Revision Control System are stored as a set of changes to the repository. Create an empty repository, that's revision 1, add a file to the repository, that's revision 2, make some edits to the file for revision 3 and add another file for revision 4. Note that file versions differ from repository revisions; version 1.0 and 1.1 of a file might be revisions 56 and 1441 respectively if a lot of other files were changed between the two versions .

Users don't work directly on the repository, they work on local copies of the files stored in the repository,  copied by the Revison Control System at a given revision level which is usually the "latest" or head revision at the time the local copy is requested or "checked out". The programmer can then edit their local copy and then submit their changes back to the repository as a new revision (a given revision can potentially affect multiple files in the repository). This gets rid of the first problem with locking as everyone can work quite freely on their local copy.

So what about the concurrency issues that locking was supposed to stop? Well, with a Revision Control System, the magic happens when the user submits their changes. The system looks at what is being changed and checks if it has had any other changes to those files since you took your local copy. If it has, the user has to go through a merge process where they inspect both sets of changes and combine them into one good working set. This may be as simple as spotting that neither set of changes conflicts and adding them both to the same file automatically, or it may require rather more complex editing. Once you've done, you can check in your merged changes.

In the case of changes to two separate files whose interdependency is only going to show up at compile time, chances are that the Revision Control System won't spot any potential problem itself, but will give the developers the tools to spot potential problems and to deal with them much more effectively when they do arise. The Revision Control System can show you if your local copy is still out-of-step with the actual repository after you've made a submission, indicating that the potential exists for a dependency conflict. A reasonable developer will then check out the latest head revision of the code and do a quick build to make sure everything's all right.

A better developer might do an update-and-build smoketest before they submit their changes to the repository in an attempt to not check in code that won't compile. I say that as a fan of Continuous Integration, who tried to instil in my colleagues a terrible fear of breaking the nightly build.

If something does slip through, the Revision Control System offers help. You can browse through all the submissions since the last good revision to see if any of them look problematic, which can help speed the fix along. You can also check out earlier revisions than the bad one if you need to keep working on something else while other developers work on fixing the problem.

All in all, Revision Control Systems are an absolutely vital  part of modern software development, to the point that I use one on projects where I'm the only person working on them, as the ability to track development history and easily manage branches and releases are just as vital as preventing concurrency issues on a shared codebase. The funny thing is that even now, new programmers are managing to come straight out of college/university with no experience in these vital tools, or much of any experience in working as part of a development team. It's a conflict between the need to train good developers who are going to have to be good team-workers a lot of the time, and the need for the college to be able to individually assess the progress and ability of their students.

No comments:

Post a Comment