Best kept secrets of peer code review: a summary

I read this book: Best Kept Secrets Of Peer Code Review so I thought I’d write about it.

Warning: the people who wrote this book sell code review software, so unsurprisingly they suggest that doing code review is worth the effort if you are using some code review software to make things easier. Also, the “waterfall” is everywhere; coding is an activity that is done after designing and before testing.

The book attempts to be an evidence-based review of code review. It looks at published literature and includes some new research carried out on code reviews that were done with their own software. The evidence, to my eyes, mostly looks like it doesn’t give any clear conclusions on how effective code review is likely to be. Mostly, unsurprisingly, the main suggestion is to do some reviews and gather lots of data on how many bugs were found for the hours taken to do the reviews. No shock there. There are a few very good things, including a review of eye movements in code reviewers.

The most interesting stuff is the concrete measurements of how much code can be reviewed effectively. This is given by two numbers·

How fast you can look through code; their measurements suggest that on Java code you should be aiming for no more than 400 lines per hour and less if the module being inspected is complex or critical

· How much time should the review take: they suggest that you should spend no more than 60 minutes reviewing a code set, after that time a person becomes “saturated” and can’t see any more bugs no matter how long they spend looking. (a code set could be a class, a function, a module, a group of related check ins in many files)

They also discuss expected “defect densities”. That is, the number of defects per thousand (k) Lines Of Code (kLOC).

  • Of course, the question is; what is a defect? It can be anything that the reviewers think should be changed in the code and obviously vary from the trivial to the critical. There are no hard rules about what a defect is; that will depend on your system but they include such things as no checking errors, not validating inputs, not checking for nulls, “off by one” iteration, inconsistent comments, comments with spelling mistakes etc
  • The harder and longer you look, the more “defects” you find. There is no real upper limit for the number of changes that could be requested after a very intensive review by multiple reviewers. Obviously at some point, the changes begin to be matters of taste rather than absolute “this is good engineering”.
  • If you, really really want to know some hard numbers defect rates are between
    • 5 defects per kLOC for very mature stable code with tight controls on the changes
    • 100 defects per kLOC for new code written by “junior” developers or in a loose development environment
    • And bear in mind that this isn’t “significant” lines of code, this includes comments and this is – in theory – working code.

Also,

  • Critical code should be reviewed by more than one person to get some discussion going on best practices
  • The book emphasized code review and didn’t really look into “design” review. The examples seemed to be more about reviewing check-ins to a stable code base, rather than large commits of new code, so there was some discussion of doing a “guided” review where the author writes a document that lists the changes to various files and attempts to guide the review through the changes. The risk here, of course, is that the reviewer sees what they expect to see, not what is actually there. The advantage is that in preparing for the review, the author checks more things and will find their own errors.
  • They also suggest that author preparation is correlated with low defect rates and when people know that their code may be reviewed, it will have less defects as there is a social effect that means people raise their game so they don’t keep making code with the same defects

So, combining these things they recommend that a review should ideally be around 200 lines and last no longer than an hour.

I’ve made my own code review checklist that I think embodies these things.

Advertisements

Advice to a young agile developer written by an old one

On reading Franklin’s essay Advice to a young tradesman I thought that it had some specifics for agile developers. Of course, it is a total piece of genius that is centuries ahead of its time, and I’m adding nothing to it. Particularly in the writing. Franklin’s prose is clean and humourous with not a word wasted. Even if you disagree with him – and I don’t see how you could – then you have to love the writing. So with apologies (and I’m not really old…):

Time is money

Every hour spent working on something that is not directly related to a feature request is time wasted, and time wasted is money wasted. Every feature that is made that isn’t needed also has a cost in that is measured in features that are needed that don’t get made. And if they do get made, the work is rushed

Quality is money

If work is rushed or for other reasons has low quality then you will spend more time – and therefore more money – on repairing it and making good. Of course, bearing in mind that quality is a feature you should bear in mind point #1 and only deliver what is needed.

Trust is money

When you deliver something and it doesn’t work, your agile methods of rapid development will be doubted. You’ll be forced to spend more time proving that quality is high, and so you’ll have less time to develop new features, and as we know, time is money. If you always deliver features on time and with known quality, then you and your methods will be trusted. Users will engage with the process more, they will be more interested in the systems you produce and will be able to produce more focused requests and more accurate bug reports.

Having trouble automating your tests? Then don’t!

We all love automated tests, right? Unit tests, functional tests, all kinds of test are better when they are automated. Unless of course, they aren’t.

The general pattern of testing software is configure-execute-assert. That is, we attempt to set the software into some known state. That includes constructing particular objects (including parameters for methods), setting configuration and even building test databases and inserting known data. Then we poke the appropriate part of the system by calling a function or injecting a message or event. The we write some more code that accesses the state of the system again and checks it against some expected value. That expected value might be dynamic (like today’s date) or static (method should return 0). The simplest code to test can skip step 1 as there is no stored state (barring simple parameter inputs to the method) and the assert step is as simple as checking what the method returns.

The kind of test that won’t be squeezed to fit this pattern are the tests that involve users. You might very easily have a requirement that “the users of the website can find what they need easily”. This is an important requirement to be sure, but not a testable one. What do you do, ignore it? There are often less vague requirements like “the report should show the individual transactions on the front page when there are less than 10 of them, but they should be on another page when there are more than 10”. That is a very reasonable – and very common – requirement for reporting applications. Unless you have hand-crafted your own reporting framework then it is also impossible to make an automated test. The *stuff* that will make that report will be a mix of data, configuration in binary files, code and all of it will be hidden in the reporting system framework; the only input will be the data from the database, the only output will be the PDF report. Creating a test that relies on doing a diff of the actual PDF file with the expected PDF file is not really an option; the test will be brittle unless the report is very stable and the test  will be constantly breaking as other parts of the report are updated.

So how do you automate? You don’t. At least, not fully.

You have to verify the requirement has been met and expecting the end-user to do all the work by clicking buttons is unreasonable. Particularly if you have to regression test. So what to do? My suggestion is that you only use people for the “assert” part and close up the rest with automation (See Scrum and XP from the trenches page 70).

So you make a test system that does the following for each test case:

  1. The configuration part. In this example, you have database entries that have 2 sets of report data; one with more than 10 transactions and one with less, you have the data stored in the test case so the user doesn’t need to remember it or input it.
  2. The execute part; the test system runs both reports
  3. The first part of assert is capturing the state; the system presents both reports to the tester in one screen
  4. The test case has a full description of what is expected; maybe including a diagram that shows what the tester is to check on the reports. Maybe some samples reports linked in.
  5. Some buttons to say “pass” and “fail” (and maybe “test broken”) that the user can click to store the result and move to the next test (N.B.: it would be good to store timing data too, be nice to know if some tests are ultra-time-consuming)

This means that the human testers are not wasting their time doing things the system can do and you don’t waste your time trying to program things that a computer can’t do. One example of this human computation in a talk from google is image recognition which simply can’t be done at present by computer, but humans can do very fast. This kind of Google approach to mass-collaboration also gives us a way around systems with hundreds of test cases. One person would go mad if they had to do them all, but if you provided a nice way for dozens of people to all take a few test cases each then no one would get too demotivated or lose concentration and the cycle-time could be kept short.

Of course, minimizing human testing is important as even the smoothest human computation is hundreds of times slower than true automation. Also, these tests would generally be system integration tests and whilst they are more important than unit tests, they won’t be helpful for catching all the bugs. So it is best to save the human-driven regression tests for the big releases and the parts that other testing approaches can’t reach.

Double D problems

I’ve been trying to come up with a cool name for problems where the devil is in the detail; the best I’ve come up with so far is “double D” but that doesn’t really do it justice. Perhaps D^2 would be better. Anyway, you know the kind of thing I’m talking about: the problem seems simple but it just gets harder and harder the more work that you do. Instead of the solution covering more of the problem, it seems to get bigger and bigger without really getting the 100% coverage that you really need. First you find the edge cases where things don’t work then, if you are lucky, you find the corner cases where two edge cases meet. If you aren’t lucky you find them only when someone tries to do something and it all crashes horribly, data lost, reputation ruined.

My current example that I’m bashing my head against is a testing framework. And right there you have the problem. It’s a framework. I didn’t want to create a framework, but I though it would be OK as it was only one interface. So I wrote a set of tools to generate test cases from the interface. The test cases would be in an XML file so more tests could be added by hand (or by excel sheet in the manner of FitNesse). Fine. Good. Except that it wasn’t. The simple cases got done very fast, just a day to knock something up. Then the slightly annoying cases (parameters that were arrays, etc). Then the really awkward ones: System.Int32&, out parameters that were datatables and callback objects. Tricky. So I fiddled around for 3-4 days trying to get it all automated and then finally realised that the test generation code I was sweating over was for 1% of the interface! Clearly automation wasn’t saving as much time as it was. So I just generated the test data once, excluded the cases that didn’t work by hand and then checked the file into source control. Done.

In cases like this; the Pareto principle is your friend. Whilst I’m a big fan of automation in general and it is very useful to automate the generation or consumption of documents like test cases or diagrams of database schemas, there is a point where you need to stop. So what are the options when the automation time-saving curve begins to bend:

Well my favourite options are
1) exclude it: you should notify loud and early when you get to a case that isn’t automated, just so you remember it
2) hardcode it: be honest: how often is that thing going to change? if changing the hardcoded value is quicker than rewriting the code to handle all the cases then hardcode it. Of course, remember that other classic of time-management: the rule of 3. If you do the same thing 3 times then find another way.
3) special case sister system: if you really have to automate the special cases, then create a different system which is designed entirely around those special cases. In my case my “one excel row=one call to a method by reflection” wasn’t going to work for methods that took callback objects to subscribe to events, but it was easy to create a test client that only did callback functions. Then you can get back to your high-productivity 20% effort for 80% results programming.

Forcing the garbage collector (2)

OK some updates that I have to share.

1) GC modes

If you add this setting to the app.config file and you have more that one physical CPU/core you get different GC behaviour.

<configuration>
  <runtime>
    <gcServer enabled="true" />
  </runtime>
</configuration>

This puts the GC on the other core, that has some interesting effects on memory and CPU. This is similar to Java VMs but I didn’t really think about it for .NET.

300000_server

you can see the CPU spikes as well as the different memory graph. Unsurprisingly this changes the throughput as well and you don’t get the “bad polls” nearly so much. Overall, thoroughput is improved as well as latency.

2) garbage collection spy class

Add this class:

    class GCNotifier
    {
        ~GCNotifier()
        {
            Console.WriteLine("**GC**");
            new GCNotifier();
        }
    }

and initialise it somewhere in main


static void Main(string[] args)
{
	new GCNotifier();
	. . .
}

then it will be touched each time a GC runs and you will see a console message although not necessarily at the same time as the GC and finalizer are on different threads. It is also a really cunning piece of code (not written by me).

Here is an excellent post on GC “secrets” from a blog I’ve not seen before, but now on my list

http://blogs.msdn.com/tess/archive/2007/04/10/net-garbage-collector-popquiz-followup.aspx

Forcing the garbage collector

I’m currently working on a system that is being designed to minimize problems with garbage collection. In the past I would have said that the garbage collection system is not the concern of application developers. I stand by that for most cases but this is a somewhat unusual application in that it combines a large number of operations per second with a very large number of objects in play (the “working set”).

I have written a small program to expose similar behaviour and it is below in full (short and complete). The basic operation is:
* create a large array of objects which hold some data
* select an object at random from the array and replace it with a new object; repeat this as fast as possible
* monitor progress on another, very light, thread that polls every second or so and reports the number of operations completed since the last poll.

The idea is that the large number of new objects makes for a very high memory allocation velocity, this means that there is a lot of garbage to clean up. The Large working set means that the garbage collector has a lot of objects to traverse and the random nature of the deletions mean that the memory becomes fragmented and will require extensive compaction in order to create contiguous blocks of memory for allocating of new objects.

Each of the objects in the array is about 4kB (= 10*100*sizeof(int)) and I started to observe the effects of garbage collection at about 20,000 objects on my dual-core desktop machine.

To run the program:

  1. save it (as say, c:\temp\Garbage.cs)
  2. open the Visual studio command prompt
  3. change to the c:\temp folder
  4. compile it running csc c:\temp\garbage.cs
  5. run the program, specifying the number of objects in the working set as an argument (in this case 1000) Garbage.exe 1000

First, let’s look what happens when the working set is small. If I run with 100 objects I see something like this:

C:\Temp>garbage.exe 100
starting main..
Filling the working set with 100 objects
starting monitor thread..
starting worker thread..
press enter to kill it
the count this iteration is:8,         ticks :0
the count this iteration is:17489,     ticks were:2127710960
the count this iteration is:17480,     ticks were:2127296384
the count this iteration is:17178,     ticks were:2127757912
the count this iteration is:17488,     ticks were:2127710808
the count this iteration is:17339,     ticks were:2127596432
the count this iteration is:17083,     ticks were:2132044856
the count this iteration is:17323,     ticks were:2127488824
the count this iteration is:17412,     ticks were:2127741720
the count this iteration is:17537,     ticks were:2127549544

so the program polls regularly and the number of updates to the data is recorded as the “count this iteration”, also the ticks elapsed since the last poll is recorded. If I run this for a while I get this kind of distribution of counts:

updates
per
iteration    #

===============

15900        2
15950        0
16000        2
16050        1
16100        0
16150        7
16200        20
16250        27
16300        42
16350        49
16400        27
16450        11
16500        1
16550        0
16600        0

average=16276
standard dev = 123
std/average = 0.8%

which looks “normal” we don’t see any spikes in memory, the garbage collector is keeping up with no pressure on memory or CPU. This is a typical app and we don’t need to worry about garbage collection. This behaviour actually persists — on my machine — up to about 20,000 objects (=20,000*4k = 80 MB). Above that, things start to happen.

When I run with 100,000 on my machine, I get something like this:

C:\Temp>garbage.exe 100000
starting main..
Filling the working set with 100000 objects
starting monitor thread..
starting worker thread..
the count this iteration is:1,           ticks were:0
press enter to kill it
the count this iteration is:12498,     ticks were:2127792200
the count this iteration is:12502,     ticks were:2127633784
the count this iteration is:12666,     ticks were:2127541688
the count this iteration is:7239,      ticks were:2127698056

the count this iteration is:12174,     ticks were:2127482736
the count this iteration is:8522,      ticks were:3823133528

the count this iteration is:12649,     ticks were:2127596272
the count this iteration is:12781,     ticks were:2186759112
the count this iteration is:12856,     ticks were:2127987848
the count this iteration is:12435,     ticks were:2127509440
the count this iteration is:6981,      ticks were:3197386008

the count this iteration is:12589,     ticks were:2148094192

we notice 2 things immediately:
1) the number of operations per poll is no longer distributed around an average, there are “bad polls” when a lot less operations complete
2) during those “bad polls” actually the monitoring thread is affected as well. The poll actually takes 50% or so longer to complete, so these bad polls are very bad, the throughput of operations on the data has dropped by as much as 75%

The distribution of operations per poll looked like this for me (your mileage may vary):

updates
per
iteration    #

===============

1000        3
2000        6
3000        2
4000        4
5000        2
6000        0
7000        0
8000        1
9000        0
10000        0
11000        6
12000        133
13000        43
14000        1

average = 10975
standard dev = 2715
std/ave = 25%

The big things to notice are
1) the operations per poll has dropped, even on good polls, but that seems somewhat reasonable but the decrease is quite a lot
2) the distribution goes nearly down to 0; the the range of operation counts went from 13681 to just 261! This “fat tail” is the result of the “bad polls”. This is what sometimes happens to stock prices, wiping out years of modest gains with a single big loss and it is why you shouldn’t always trust salespeople who tell you that the “average” gain on this investment is 10%, if, once every 10 years it loses 50% of its value you won’t do so well..
3) nearly 10% of the polls were “bad”; that is, having less than 50% of the average count

A picture is worth 1k words so let’s look at task manager:

garbagecollection100000

and the saw-tooth pattern of repeated garbage collections is quite clear. Now we don’t see a peak in CPU activity, as we are already running at high CPU but I guess that the CPU has switched from doing the real work to doing the garbage collection work!

So, the question is: how do we avoid this.
The answer is mostly that you don’t need to. If you aren’t trying to do 10,000 operations per second on 10,000 objects of reasonably large size (I don’t regularly create data objects with 100 integers inside). If you do have to do that then the best thing to focus on is object lifetime. If you need to create objects, make sure that they go out of scope quickly so that the can be collected in the generation 0 garbage collection, which is much faster. If you just need to store some data, don’t create new objects but modify the data stored in an object that you already have so the object lifetime becomes infinite, then it won’t be collected at all. Finally, be careful how you pass objects around, if references to them get stuck somewhere – for instance in a collection – you might be extending lifetimes that you don’t intend to. This kind of non-reference passing, controlled object-lifetime code is tough to write and maintain, so don’t do it if you don’t need it. The other alternatives like specialized hardware like Azul for Java or buying in a product that is an in-memory enterprise database (sometimes called an Enterprise Data Fabric or EDF) are both even more expensive.

code follows:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Diagnostics;
using System.Threading;

namespace GarbageCollectConsole
{
class Program
{
/// <summary>
/// the large working set that will be the garbage collectors
/// main problem; as we update it, it has to work hard to keep it clean.
/// </summary>
private static ComplexObject[] _data;

private static int counter = 0;
private readonly static ReaderWriterLockSlim counterLock = new ReaderWriterLockSlim();
private static int lastCount;

/// <summary>
/// better than using DateTime.Now.Ticks
/// </summary>
private static readonly Stopwatch clock = new Stopwatch();

private volatile static bool stopNow = false;

/// <summary>
///  the entry point to the program
/// </summary>
/// <param name=”args”></param>
static void Main(string[] args)
{
Console.WriteLine(“starting main..”);

int arraySize = GetArraySize(args);
Console.WriteLine(“Filling the working set with {0} objects”, arraySize);
_data = new ComplexObject[arraySize];
InitialArrayFill();

var mon = new Thread(MonitorCounter);
Console.WriteLine(“starting monitor thread..”);
mon.Start();

var worker = new Thread(AllocateObjects);
Console.WriteLine(“starting worker thread..”);
worker.Start();

Console.WriteLine(“press enter to kill it”);
Console.ReadLine();
stopNow = true;
Console.WriteLine(“press enter to exit main…”);

Console.ReadLine();
Console.WriteLine(“about to exit main…”);
}
/// <summary>
/// parse the command line to get the size of the working set, start with
/// at least 10,000
/// </summary>
/// <param name=”args”></param>
/// <returns></returns>
private static int GetArraySize(string[] args)
{
if(args.Length<1)
{
return 100000;
}
else
{
return int.Parse(args[0]);
}
}

/// <summary>
/// start another thread that will put messages to the console
/// this is so slow that it won’t really affect things on the
/// worker thread
/// </summary>
static void MonitorCounter()
{
while (!stopNow)
{
int localcounter;
try
{
counterLock.EnterReadLock();
localcounter = counter;
}
finally
{
counterLock.ExitReadLock();
}

clock.Stop();
Console.WriteLine(“the count this iteration is:{0},\t\t ticks were:{1}”,
(localcounter – lastCount),
clock.ElapsedTicks);
clock.Reset();
clock.Start();

lastCount = localcounter;

Thread.Sleep(1000);
}
}

static void InitialArrayFill()
{
for (int i = 0; i < _data.Length; i++)
{
_data[i] = new ComplexObject();
}
}

/// <summary>
/// randomly abandon a referenced object. One of the objects
/// that is referenced from the _data array is replaced. As the _data
/// array is the only reference to the object, it is available for
/// garbage collection. As this loop only hits about 10,000 per second
/// — on my machine — most of the objects in the array generally live
/// for a long time, thus they get into generation 2.
/// </summary>
static void AllocateObjects()
{
var indexGenerator = new Random();

while(!stopNow)
{
Interlocked.Increment(ref counter); //safe to do as this is the only thread updating
_data[NextIndex(indexGenerator)] = new ComplexObject();
}

}
/// <summary>
/// use random, as we don’t want to just go through the array as it will
/// end up being contiguous in memory again. We want the heap memory to get
/// as fragmented as possible
/// </summary>
/// <param name=”indexGenerator”></param>
/// <returns></returns>
static private int NextIndex(Random indexGenerator)
{
return (int)(indexGenerator.NextDouble() * _data.Length );
}
}

/// <summary>
/// a simple object, I had to include the _data array
/// to add some memory pressure. Without it the memory allocation
/// velocity was so small that the memory was cleared without even
/// noticing a CPU or memory spike
/// </summary>
class SimpleObject
{
public int Number { get; set;}
public readonly Guid Foo;
public readonly DateTime stamp = DateTime.Now;
public readonly int[]_data = new int[100];

public SimpleObject()
{
Number = (int) (new Random()).NextDouble();
Foo = Guid.NewGuid();
for (int index = 0; index < _data.Length; index++)
{
_data[index] = index;
}
}
}

/// <summary>
/// create a deeper object graph, may not make a big difference
/// </summary>
class ComplexObject
{
public readonly SimpleObject[] _data = new SimpleObject[10];

public ComplexObject()
{
for(int index =0; index< _data.Length ;index++)
{
_data[index] = new SimpleObject();
}
}
}

}

Love in an elevator: the right way to do a stand-up meeting

Lunchtime: people in my team got up to go and get a sandwich. Not wanting to be left behind even people who didn’t need a sandwich went out with the others to get some air. As soon as we got into the lift we automatically formed an inward facing circle — a huddle — and downloaded what we had been doing that morning. By the time we got to the lobby we had done a mini, half-day scrum!

What an excellent way to enforce standing up, standing close and timeboxing! Sadly, it doesn’t work for distributed scrum teams, you can’t even use your mobiles to conference people in as the elevator makes such a good faraday cage.