Udi Dahan on durable messaging

Udi Dahan has written a good MSDN article on messaging and some further comments on durable messaging.

(I’m going to reply here and link to here in the comment)


I particularly like your observations on when durable messages work against you. I work in the finance industry and as you note we often use a mix of durable and non-durable messaging solutions. For applications like price streams you may need to ship thousands of messages a second but don’t care if you lose a few, but for the submission of orders you must be certain that you don’t lose any.

There are several messaging middleware providers that target the finance industry specifically — e.g., TIBCO — that try to address the low-latency requirement. We have used non-transactional MSMQ and have got up to a few thousand messages per second; TIBCO claims to support up to 50,000 messages per second! However, they don’t specify if you need a z/OS mainframe to do that…

Your comments on very large messages were also interesting. The decision to allow multi-message orders seems like it caused really far-reaching changes to the design. I wonder if you considered solving the problem “internally” by inserting a message processor in your inbound message stream that broke up large messages – a splitter – then you have smaller messages to deal with but you are in control of the message order and flow. If necessary you could have diverted them into another queue and maintained order that way and allowed other messages to “overtake” in the regular queues. Of course, this is still something that is painful as it changes the way messages flow and you still have to parse the 50MB XML. However, in the case I have seen with similar problems the counterparty was such a lumbering behemoth there was no chance of them being able to refactor their solution to make any changes to the message format or the message choreography (which I think is a lovely way to say the “request-response pattern for the message conversation”).

That is the power and the pain of messaging: it provides a clean interface – the message format – to work with counterparties, but if the message conversation starts to change then the asynchronous nature of messaging can make the changes pop up everywhere in the message chain.

But, great article to introduce a really interesting subject.


Building stable applications

Let’s be clear before we start, this is about systems that are

  1. “Business software”: the conclusions here are mostly concerning the special kind of complexity that is faced in business problems, but isn’t faced in, say, compilers or graphics engines or games; they have different types of complex problem
  2. “Enterprise software”: Software that is what Martin Fowler describes as “interesting”. That is, it is software that connects to other bits of software and tries to do something that has some relevance in the real world.

It is a classic truism that developers are very good at solving the wrong problem. When a user presents with a problem — call it x1 — and says that maybe they will have the problem x2 in the future. The developer listens and very deliberately goes off and solves the set of problems x* which contains x1, x2, x3 and any other related problems. They also try to generalise their program so that when problems y1, y2, y3.. present themselves they can just change some config and have that licked too.  Of course, if we just add one more layer of abstraction, one more interface, one more pattern then we can generalise it to do anything..

I’ve made this sound ridiculous but in some cases it can actually work. It depends on the developer doing two things

  1. correctly interpreting the problems presented by the user
  2. casting the problem into a suitable programming problem.

1. Interpreting the problem
This can be seriously complicated by the user trying to express the problem in “computer language”. We all know that you shouldn’t give people what they ask for; but what they need . If the developer is very skilled – and it seems to be a mix of experience and talent – then they can solve the underlying problem that the user has, even when they a set of symptoms that seem unrelated. If we can see past the symptoms and diagnose the underlying problem we can sometimes solve many problems at a stroke. Even better than that, it can stop the kind of low level chatter of bugs that drives a developer nuts. The user is constantly raising bugs about “the system” intermittently failing and their machine needing reboots. There are no intermittent problems, only intermittent symptoms and maybe some of those bugs are all linked to a common cause. If only you could see through the veil.

This is a wonderful feeling when you can do it for someone. What this really requires is not a requirements capture process and a business analyst or a focus group. It requires talking to people. This is very obvious in a big company where people can’t talk to each other as the support team is in Boston and the user group is in Sydney and the desktop support people who get the call are in London. If you can ever get the right person on the phone, you can fix the problem in just one minute.

2. Solving the right programming problem
Real-world programming is not about solving the problem that someone gives to you.

My daughter has a shape-sorter.

That is a problem that she can solve by herself. However, it is not a real problem. If it were a real problem it would be possible to jam some of the shapes through the wrong holes by twisting them around or taking advantage of the materials that the thing was made out of and bending the holes or the shapes. But this is a problem that has been made to be solved. It has been made by people trying to make a problem, not by people trying to make a solution. So the problem is engaging and tricky but not impossible and there is exactly one solution.

In the computer science classroom you must solve the binary-sort problem as you are given it. In the real-world the best system developers don’t solve hard problems, they work around them. The skill in casting the problem into a simple form and drawing the boundaries around systems so that they can present consistent, stable and self-contained interfaces to the world. And, of course, unlike the shape sorter you should recognise when there are exactly zero solutions to the problem, then go and solve a different, related but still useful problem.

A stable problem gives a quality solution… eventually
Some programmers have a knack for turning real-world problems into programming problems that have neat solutions. Part of that neatness is a problem that doesn’t change every five minutes. It can be coded once and coded right and it seems that life is very simple for these people!

All it means is that a person knows how to look at a whole mess of concepts and data and process and can pull it into some smaller chunks. The important thing about those problem chunks is that they are stable in some sense so they can solved by a system. The problem chunks need to be:

  • internally cohesive so all the stuff that is together belongs together; then the system is conceptually unified, so all of the features are related
  • well separated from other chunks so they only interact along the chosen interfaces; this means the systems are conceptually normalised, there is little or no overlap in function between systems

The person who is excessively good at doing this may not even know that they are doing it; just as a person with a good sense of direction knows where they are. It just seems to make sense to them to break up the user tasks in that way, in a way that provides a nice edge or interface to the system. I’m not talking about the actual interfaces of the object-oriented language, but system boundaries.

For instance, a relational database has a very nice system boundary. It contains literally any type of data that can be serialised into a stream of bytes – and humans have got that down, the only things we haven’t reliably serialised are smells – and it can organise that data into “lists”, then search and retrieve that data. Simple.

Early Spreadsheets like VisiCalc used to have a good boundary. Anything involving tables of numbers, it did; anything else, it did not. And VisiCalc was programmed by one guy in about 8 months. Then things like Lotus 1-2-3 came along and the lines started to get blurred. Graphics, charting, database but still a coherent system based around tabular data (and the first versions Lotus 1-2-3 were written a year or two by a single small team).

And then, you get to recent versions of Excel which is, in my opinion, everything you would want from a application development platform (except type safety, of course 😉 ) as well as being a phenomenal spreadsheet, database, graphics program, etc, etc. However, Excel has more engineering hours in it than the space shuttle; and the space shuttle didn’t have to have marketing focus groups on where the buttons would be and what the default font would look like. Solving all those problems together was hard. It has taken Microsoft more than 20 years and probably 20,000 man years of effort; let’s think about that number for a second: 20,000 years of effort. Of course, they have solved a unstable problem (in fact, many unstable problems) that are prone to many small changes as features are added; but does anyone want them all? Well 200,000,000 users can’t be wrong but maybe another solution that contained 20% of the features at 20% of the cost would have captured 80% of the market.

The stuff that is being done at Google docs (where I am writing this) or 37Signals has been done like this. Find a group of user tasks that goes together, solve them together and then stop. If you play with it for 30 minutes you see that it is all very slick; and self-symbiotic (I just made that term up). Every feature complements another feature. It is complete, not because there is nothing to add, but because there is nothing you can take away. That kind of application is very stable; the cloud of functionality that is Excel can never be stable, without huge effort that instability will really hurt the quality of the product.

What is interesting is that if you can cast the problem into a stable, well-bounded problem then you can attack it iteratively as the stability means that domain experts and application users can get a feel for what the application is doing; they have a good mental model of the problem that accurately maps to the system and they can still navigate the application even when there are changes. What is even more interesting is that if the problem is stable then you don’t need to attack it iteratively. You can go waterfall or spiral or whatever you want because when it is solved, it is solved. Ok, in five/fifty years time you might want to slap a web/telepathic interface on it, but your core system won’t need to change. The system is durable because the problem is durable.

My favourite example of this is double-entry accounting that I experienced first-hand. My company is a very small financial company and we don’t mind having multiple releases – sometime multiple releases per week – that increase functionality; but, in general, people are against refactoring because it means that you got something “wrong”. I couldn’t understand this for a long while; what is wrong with refactoring if you don’t mind the multiple releases? And what is more, they seemed to have got by without refactoring perfectly well and in most cases the systems were durable enough to survive for years.

In one particular case, the system had been almost untouched for nearly 10 years. By any metric, to survive 10 years in production is pretty impressive, and I couldn’t understand how that had happened without any refactoring. The developers claimed that it was all down to “thinking really hard”; the implication being that people who refactor are stupid. It took me a while to realise that the stability was, in part, down to solving the right problem. The system that lasted 10 years was the double-entry accounting system (database and application tier, not the reports, they change every other minute!) and that is something that hasn’t changed a great deal in decades. Of course, compliance like SOX and best practices for public accounting have changed but the fundamentals of double-entry are very, very old. Now the system didn’t do much just kept a list of the balances in the difference accounts but it was sufficiently generalised to cope with any of the new situations but sufficiently specific to be useful just as it was. One of the nice things about the double-entry is that you can represent any kind of asset, even types that don’t exist when you create the system because new types of asset, new types of income, new types of anything to do with anything that can be written down as a number that is an amount of money can be stored as data that doesn’t require changes to the application or the database schema.

Of course, the code was also pretty neat, but if the problem is neat the code can be neat as there are no special cases. And neat code is good because it is easy to test and easy to review, and that means that the implementation quality can be very high; as you don’t have messy code you can concentrate on things that are outside the domain of user-visible features like using reliable messaging, distributed transactions, or driving up performance by using multithreading or even assembly language;as the problem isn’t changing you can concentrate on driving up the quality to the point where quality is a feature.

A stable problem allows you to create a system with a stable design, and that stable design allows you to concentrate on making an application that has no hacks.

Smarts alone are not enough

Working in a company that is obsessed by hiring people on smarts alone I can attest to the fact the being smart isn’t always helpful. The use with of Google-style puzzle questions to select candidates rather than “boring” questions about “what does this piece of code do?” drives me nuts.

I overheard someone here saying “All the new graduates we interview seem to be better at the puzzles than the more experienced hires, I wonder what that means?”. The answer is: it means they are better at puzzles, nothing more. I think the important thing is that software engineering is not like pure mathematics. Your achievments are not 100% correlated with your IQ.

The point is that there is such a large body of knowledge – admittedly not as developed as the BOK for structural engineering – that it is not possible to work out the best way of doing things from first principles. Just because a person has a 140 IQ it doesn’t mean that they will figure out how to make an enterprise strength messaging system – or a secure cryptosystem – because it takes tens of man-years to do such a thing.

Tinkering around with little programs doesn’t teach some of the things that an experienced programmer on large (i.e., more than 100k lines) systems take for granted. Having spent some time trying to tell very clever people (i.e., PhD from Cambridge, self-taught programmer) that they should do things in a certain way because that is the best practice, I know that they aren’t always responsive. They think that they can manage the complexity because they are clever, and immature enough to be in love with their own cleverness.

Of course, there is a point where the system outgrows them and having no structure means that point is meltdown for maintainability.

Experience actually does mean something. Raw knowledge of best practice obtained from books means slightly less on its own. But the two together are really useful. Now, if you can combine them with a person with enough IQ to do the job, and enough maturity to not want to show off their IQ… then you have a software engineer and not just a clever hacker.

Exceptionally fast

scale is in seconds
time scale is in seconds

This is the result of a test I did with some very stupid code to test the speed of exceptions. We sometimes hear that exceptions are slow and I wondered how much the stack depth affects the result. I used very simple code that recursed if the level of recursion was less than a threshold and threw an empty ApplicationException when the threshold was reached. The test was repeated one thousand times and the results are graphed above.

The verdict? Well – as you might expect – the answer is : it depends on what you mean by slow!
i think that the really slow exceptions are ones that occur as a result of a problem with COM or p/invoke or ones that are marshalled across remoting or AppDomain boundaries. I’d like to do something with recursed AppDomains as that could be really painful.
PS: apologies for the image, I am simply too lazy to figure out how to make it display properly. I just cut as pasted it from Google docs where I wrote this post. I have no idea how that is working!

Project thoughts: Every task takes a week

I recently went on a basic project management course, and while discussing while estimating I had a minor insight.

It is a frequently cited problem in software development that estimating how long it will take to do a task – whether desgin or implementation – is hard. On the project course we discussed using ranges instead of simple estimates and using the size of the range as a measure of risk. Some people even objected to that, saying that you could only estimate what you had done before. There is a grain of truth in that but, IMHO, once you have written your first “hello, world” in a language everything is similar to a greater or lesser extent.

When I have done any estimating myself I noted how frequently I answered “how long will this take?” with “a week” or “two weeks”. My feeling is that, any task that is too big to be done in a week is generally too big to even attempt to estimate or even give a title to, so we split it up into chunks that generally take, well, about a week. And any task that is so small that it would take less than a week is combined with other tasks that add up to, well, about a week!

If you have ever seen the movie The Money Pit or worked with house builders at all then you are familar with the two week estimate. There is only ever two weeks of work: what we do this week and what we intend to do next week!

Project thoughts braindump

I went on a project management course recently. It was pretty 101 level but good enough for a n00b like me. It fired off a few thoughts that I want to dump here, not much insight here, but I want to write some more about it in another post.

What was interesting is that I feel that Agile is already accepted as being a great thing for business software, but this course basically ignored it on the grounds that
a) MOST people are still doing waterfall, in business or not
b) project planners sort of dislike it as it is hard to get the predicitability of waterfall.

a) is, I think, undeniable. b) is wrong but interesting. The problem (as I note below) is that people like waterfall because it feels precise and ordered, but when it breaks you only find out late, and their is no way for it to degrade gracefully.

The problem with Agile is, of course, little emphasis on documentation of design or user guides etc etc. The emphasis is on producing working software of known, high quality in a fixed time with all the flexibilty coming from precisely what functionality is delivered. To some people on the course (e.g., people writing firmware for tomahawk cruise missiles – no, really!) flexible function and no documentation is not an acceptable compromise, not matter how fast and cheap you can write the software!

* you must succeed and be seen to succeed: reporting is not an afterthought

* project manager’s role is to be looking ahead to the next phase, not working on the current set of tasks.

* checklists and templates are a kind of “project process lite”, not a substitute for real thought but provide something to react to. Most of project management is based on using what you already did that was similar. Good checklists/templates to have
* project stakeholders
* task estimation spreadsheet with 3-point estimate logic
* project milestone/gate reports
* risk checklist

* project owners are the people who have the right to judge whether the project was a success. project owners measure success in terms of things that were delivered 100%, not in terms of effort and tasks completed.

* a project goal statement is the definition of success; it is not high-level requirements list, nor a brief definition of the project owner’s problem, nor a desecription of the proposed solution.

* project stakeholders are any people who are impacted by the project.. not neccessarily people on the project team doing the work.

* by creating and publishing a project goala you may flush out a unhappy – and previoulsy unknown – stakeholders. we can then correct the project goal at very low cost in effort and political standing. It won’t be easy, but it will be easier than changing course later.

* it is better to have one project goal that everyone knows about. For projects that have stakeholders that are outside the organization – or have such disparate opinions and viewpoints – then multiple goal statements may be needed. Consider your honesty and what you do if you are “found out”.

* the project management lifecycle is separate from – but interlaced with – the software/system development lifecycle. The initial gathering of high-level project objectives are correlated to early requirements capture for development, but project goals are NOT the same as the user requirements that will be needed to create even high-level architecture of the system/software.

* the advantage of the waterfall model is that the project can proceed with many specialists who are coordinated by the project manager and work on different phases. only a few project “masters” are needed who understand all phases and work on the project for the whole cycle.
… the disadvantage of this is that different specialists will have little ownership and will not feel involved when their part is “done” and that can lead to different parts of the project team being at war (e.g., developers and

* project plan estimates are often nonsense: project managers pad time estimates for tasks in the plan and then senior managers cut these estimates arbitrarily because they know that they have been padded! How do we avoid these games and keep the plan a real tool for planning our projects? The answer: keep estimates ACCURATE but IMPRECISE by using a RANGE. The more imprecise the estimate is, the higher the risk. In some cases, highly imprecise or otherwise highly uncertain/risky tasks should be pushed towards the upper end of the estimate range. This means that only a few tasks are padded, and if senior management want to cut estimates for risky items then they can take responsibility for these specific cases!

* Building good, accurate plans is only possible where estimates are accurate. We must estimate then measure, re-estimate the remaining tasks and re-plan the project tasks to ensure that the project plan is still relevant. We must ensure that the estimate ranges are compared with the measured duration of the tasks so people can improve their estimation skills.

* Project planning and software design is inherently iterative; you must revisit earlier stages as you discover changes in the scope and deliverables of the project. The advantage of agile processes is that it recognises this explicitly; the disadvantage of waterfall is that it maintains the illusion of precision for too long; the iterative corrections are a footnote that we only discover late in the project.

* The balance between agile and waterfall is how many corrections we have to make; if the scope doesn’t change significantly then we can get away with the waterfall and get the advantages of simplicity, predictability and resourcing and staff with a single skill-set. If changes are constant or large then we need to recognise it and use agile which are unpredicatable (you don’t know what you will get, you only know that you will get it when the iteration ends) and requires highly-skilled, multi-talkented developers who are highly motivated.

* Reducing length of the critical path by overlapping dependent activities is possible but entails increased risk. (could be financial risk of sunk costs in a cancelled project)

* Risk control activities: prevention, reduction (of impact or probability), acceptance (absorbing risk), transfer (of control or the risk itself), contingency (what we do when the risk comes to pass).

* Research to try and evaluate probability or impact of risk is a TASK that should be on the project plan.

* If we find a risk but don’t want to carry out the risk control work but don’t want to accept the risk then we can use risk monitoring TRIPWIRES to regularly monitor some metric that will indicate increased probability or impact of a given risk. The effort of monitoring the metric is some small repeated task that will enable the FULL risk control task effort to be saved unless it is needed.

* Project reporting for milestones should be deliverable based, not task based. No project owner casres what effort you have put in, only what you have delivered 100% complete. A 90% complete feature is a missing feature. For reporting inside the team, you can be activity based, as long as people know what these activities are. For informal, regular (say daily or weekly) meetings you can report what people are doing now.

* remember the project manager’s job is to be ahead of the team preparing the ground ahead so you should be able to report what you will be doing NEXT. You should also be clearing behind the team ensuring that milestones/gates are truly delivered.


If you can’t name it, you can’t write it

Some people in my company have recently decided that they need some some more layers in their architecture. They are right, as it happens; but they are going about it all wrong. They are attempting to create a framework and we all know what happens when you put on your architecture hat and start building frameworks. Architecture astronauts only. Rather than solving specific problems and rolling them up in a framework they have started with the idea that they want a framework that wraps the database up and produces objects. And it should use LINQ. And it should do all sorts of cool stuff.

Unfortunately, since they don’t know what cool stuff it will do, they didn’t know what to call it so they held a little competition for people to suggest a name. And that is the problem, if you don’t even know what a thing should be called, how do you know what it is? If you don’t know what it is, you cannot sit down and write the code.

Of course, I pointed this out and made them very unhappy. A quick browse throught the source control history revealed that they had not started coding without any requirements. Of course, there were lots of empty interfaces with one object implementing the interface with only the method signature! And they really funny thing is that they have already changed the name.. When they started they called it ClientAPI (actually the “client” refers to the customers of the business, not client as in “client-server”). The later they changed it to ClientBusinessObjects. Great name.

And the really, really funny thing is that they aren’t business objects, they are actually database objects and that factory classes from the database! It will be really fun when they start reading pattern books too!