I’ve been struggling for a metaphor to help explain the difference between statistical forecasting and estimation, and this one came to me, so lets give it a whirl.

Let me express the scenario in some human readable BDD (Gherkin) language

*Given I am an orange juice shop ownerand currently have no orangeswhen I buy 25 boxes of oranges of varying sizes and varietiesthen I need to forecast how many glasses of juice I can produce to sell*

Let me now describe two different approaches, the first like normal agile estimation processes, the second like a scientist.

## Method 1 - Estimation

I close my shop to customers gather my team together. I open the first box of oranges and hold up the first orange, showing them how plump it is and telling them it looks like it is probably a Belladonna variety orange (http://en.wikipedia.org/wiki/Orange_(fruit)#Common_oranges).

I ask them all to estimate how much juice the orange will yield. They each choose a planning poker card and I count them down, and they reveal their numbers. I ask them to discuss any outlying responses and then re-estimate until we get consensus. I note the number and move on to the next orange.

The second orange is a Berna orange and is quite small. We estimate that one. And so the long day goes on, with us repeating the process and selling no orange juice.

Eventually we finish estimating the first box, and we have a number which is our estimate of how much juice box 1 will yield. We have to make a decision, do we stop estimating and open the shop to sell some juice or do we go on and do the same process on box 2 through box 25.

Some juice shops stop at 1 box and use the estimated yield for that box as the "magic number" of yield per box. Other shops keep on going until all 25 boxes have been estimated one orange at a time, because they need a “more accurate estimate”. The downside of the accuracy is that we had to buy it by keeping the shop closed for 25 times as long.

What do I get at the end? Somewhere between 200 and 300 Story Points of oranges.

## Method 2 - Mathematics & Science

I open box 1, and count how many oranges are inside. I also open boxes 2, 3, 4, and 5 and count the oranges in each of them. 10 minutes later I open the shop for business and start my staff selling juice.

When a customer orders juice we measure how much juice the first 11 oranges actually yield. We don’t estimate, we measure.

**I now have enough data to make a pretty accurate forecast **of how much juice the boxes will yield. It’s only taken me 10 minutes.

Now for the sciencey bit. The German Tank Problem (http://en.wikipedia.org/wiki/German_tank_problem) is a famous bit of Bletchley Park Boffinery from the second world war. To save you a bit of reading, the Allies wanted to know how many of a particular tank they were likely to come up against in France when they invaded. There were 2 ways of getting the forecast, via Military Intelligence estimates, or using Statistic and probability. Lives depended on this so it had to be correct.

Here is a comparison of the 2 methods used over time, and on the right, the actual numbers found out at the end of the war.

Month | Statistical estimate | Intelligence estimate | German records |

June 1940 | 169 | 1,000 | 122 |

June 1941 | 244 | 1,550 | 271 |

August 1942 | 327 | 1,550 | 342 |

So the provenance for using this method is pretty good. Lets apply it to the oranges.

The key to it is understanding that the maths lets us make very accurate predictions using very small sample size. Indeed the formula below shows how likely the next item measured falls within our existing range of highest to lowest values, where k is the size of our sample

% Likelihood = (1 - (1 / k – 1)) * 100

So if we have 5 boxes of oranges with 17, 23, 16, 30, 25 oranges in each, the likelihood of the 6th box having more than 16 oranges and less than 30 is (1 - (1 / 5 -1)) * 100 which is 75%. 75% likelihood of all future boxes being inside the current know range from a sample of only 5 boxes.

A sample of 11 gives us 90% likelihood of the next being within our known range. Credit goes to Troy Magennis for explaining this to me over a couple of *Weissbiers* at the Kanban Leadership Retreat in 2013.

So thats a likelihood of 75% that we have between 16 and 30 oranges per box. That gives us a median of 23 oranges per box

So, how much juice will we get per orange? For the 11 oranges I measured I got 79.1, 78.5, 71.2, 72.1, 65.2, 79.3, 73.2, 67.2, 65.0, 75.3, and 69.1 ml.

So thats 90% likelihood that all oranges have between 65.0 and 79.3 ml each. That gives us a median of 72.2ml juice per orange

So I have 25 boxes, each with 23 oranges, each giving 72.2ml juice. I have a total yield of 25*23*72.2 ml of juice = 41.515 Litres of Juice. I’m going to have to buy a lot more boxes to keep my shop in stock for the day. I’m glad I found that out early enough to get back to the wholesalers in time.

In the knowledge work world, we tend to have to solve the same problem for forecasting work completion, and measure days per work item, and work items per “epic” or “MMF” or “MVP” or Project (whatever you call your orange boxes in your context). If you want real accuracy instead of working out medians and using those, you would plug the very same numbers into a Monte Carlo Simulation of your system of work, and work out how many of the project runs finish before each date. That is much more complicated to do, and requires a good old dose of processor power, but is far more accurate than my simple sums on the median values. However, even my simple sums are much more accurate than estimation, and cost much less time off from doing value work to generate.

If you’re running an IT Software project with 25 epics, you’d break down the first 5 epics into stories (the ones epics you’re going to work on first anyway) to work out the stories per epic number. When you start work you can see that each story takes between say 2 and 9 days based on a sample of 11 stories… We have all the data we need to make a good forecast, and not one piece of estimation has occurred. Most of the data is derived from doing the actual work we need to do to finish the project. Which is ideal, as it means we are focusing on doing the thing that will get us finished, and the forecast is a secondary outcome, not a distraction from doing the work like with estimation.

Tweet

# Numbers Game

Whenever I’m in a meeting talking about metrics, someone always brings up Velocity. If you’re familiar with Scrum you’ll know that in this context, velocity means the number of story points completed per sprint. If a team completes 4 stories each of which scores 5 story points in one sprint, the velocity of that team is said to be 20. If that doesn’t make sense, you should probably do some googling about now and come back when it does… don’t worry, I’m happy to wait for you. Look up planning poker while you’re on.

Ok ready to continue?

## Velocity isn’t a metric

No, Velocity can never be a real metric, and to use it as such is to play a dangerous game. But let me explain why this is so, and if I’m taking it away, what you can replace it with (or use alongside) as a real metric.

## Numbers are Magic

It is my belief - although as yet I do not have any scientific data to back myself up - that numbers hold a special place in our minds. Numbers and Mathematics are after all an abstract construct developed by humans as a language to explain science. It is my premise and belief that we see things that are expressed as numbers as things that we can control. If my mass is 125 Kilograms, I can target loosing 5 kilograms then go on a diet and exercise regime, and measure my progress with my measured mass on any day.

## When is a number not a number?

What madness is this? Well I’m trying to show the difference between a real mathematical number like say 3.142 and a string of digits like 055555 555 555. You probably recognise both of those, one is a shortened for of Pi, which we use in trigonometry to work out areas and circumferences of circles, and the other is a fake uk phone number. However doing maths makes sense with Pi, it doesn’t with a phone number. Imagine adding 44 to them.

x= 44+ 3.142, so x= 47.142

y= 44 + 05555555555 ….

While you can work out that y = 5555555599, it really makes no sense, what we really want to do is manipulate the string of digits to be +445555 555 555 - the international form of the phone number.

So some numbers are real mathematical numbers and others like phone numbers are just strings of digits, so we need to very careful whenever we use something that looks like a number but is really a string of digits as people will most likely not understand the difference without explanation and try to use the number as a real mathematical number and a control point.

Velocity is a tricky one as it looks like it should be a real number, after all we can “Measure" it for a team. But appearances can be deceptive. My old Physics teacher at secondary school once told me that “if the maths is ever confusing, look at the units and they will help you understand what is going on.” Wise words for maths, physics and velocity. Velocity is measured in “story points completed per sprint” which might as well be “tulips completeted per sprint” The problem with that is that it is a malleable number.

## Malleable Numbers?

What do I mean by a malleable number? Malleable means “easily influenced; pliable”. So let’s pretend I’m a Command and Control manager who is up against it for a project. I see Team A with a velocity of 22, and Team B with a velocity of 35. If I think these are real numbers and in my mind they are therefore control points, so I would be likely to ask the question, "how come Team B is so much ** better** than Team A, and why can’t Team A do 35 points per sprint too?”

We, as the kind of people who read blogs like this, probably know innately that that question is a danger sign, but lets follow it through.

Team A get pushed into working towards a target velocity of 35 points. So what are they likely to do in the next story grooming (refinement) meeting when they come to estimate their stories? I suggest that a story that would have been a 5 last time round will now be an 8 - or even a 13 point story. Why am I so confident? I’ve seen it happen in my own teams. But that is the least of the problems.

As a free gift with the instruction from manager to team there is a side order of “understanding that Management don’t seem to have a clue” which undermines the organisation, and pushes the team towards being more insular and deceitful.

It has a negative effect for everyone. Even if the team doesn’t “cheat” on estimation, they may well start “sand bagging” the sprint by working longer or extra hours. And of course that will artificially uplift velocity at first, but as people get tired, the velocity will actually drop (again I’ve seen that in action) as people stop working in a sustainable manner.

As soon as the thing we are trying to control is malleable, as human animals we can’t resist playing with it. A bit like a blob of blu-tac on your desk, it’s malleable and you can’t help but play with it. How do we solve that problem? We put the blu-tac away in a packet, drawer or cupboard. We need to keep the malleable stuff away from the people who want to play with it - so it is with velocity. Just like blu-tac, velocity is a useful tool when used correctly (like for deciding when to stop grooming stories at the story grooming meeting), but should be put out of reach the rest of the time. Use something which isn’t malleable as your metric - like how many days stories take to deliver. In the Kanban world we call that the Lead Time, and it is measured in days (or sometimes hours), a fixed unit of time. Unless we solve the problem of light speed travel, fixed units of time are Immutable Numbers.

Tweet

I recently blogged here on Is Estimation Waste and I would like to come back to it. There is a movement in Lean and Agile towards #NoEstimates which has a lot of value, but I still believe that we are risking missing something if we don't find an alternative method to get the value we get from Estima**tion **(shared tacit knowledge just before we start work on the Work Item) when we throw away our Estimates.

So with that warning all done, how do we go about throwing away estimates? To start simply, we need to ** size** our Work Items instead. This is something we all already impicitly do before we bring 'stories' or Work Items in to be estimated, but now we just need to be a bit more granular with it. Let me use a metaphor.

A dry stone wall is an ancient British technique for building walls using stones with no adhesive cement, hence 'dry' stone wall. This skill has allowed what is essentially a cleverly constructed wall which is a pile of stones to stand solid for hundreds of years in some of Britain's most exposed and weather beaten areas.

As you can see in the photo to the right, each stone in the wall is a different size and shape, every stone is unique. However while some stones are small and some are large, none of them are Boulders, and none of them are pebbles. Each stone is small enough to be lifted into place by a single person.

Interestingly there is a rule in dry stone walling, you don't go looking for a stone to fit a place, you pick up the next stone and do not put it down again, you find a place to use it.

The point I am making is that you cannot use Boulders in the wall, they are too big and unwieldily, so boulders have to be broken down into stones, and once a stone is stone sized, it gets used as soon as it comes to the top of the pile. It would be hard and also pointless to estimate how big each stone is, they are all irregular shapes with sloping sides and pointy bits.

This is the same thing as sizing your Work Items. Like a stone, every Work Item is unique, with differing size, complexity and emergence (all work items are likely to have some differing level of emergence in a Complex Adaptive environment). It takes much less effort to look at a proposed work item and decide if it is a 'stone', a 'boulder', or even a 'mountain'. If it's bigger than a 'stone' get out your hammer and cold chisel and start breaking it down into stones. Judging if a Work Item is 'stone' sized takes very little effort for a very small number of people, much less effort than estimation.

So by sizing them using maths with only 11 previously completed work items I can give 95% likelihood that any new work item will be smaller than our current largest. Contrast that with using Estimation instead. Once you have 11 work items finished each of them stone sized, you know the 12th work item is likely (this is the mathematical version of the work likely) to be smaller than your largest completed work item and larger than your smallest completed work item with a probability of 90% without even knowing anything about that new work item other than it has been sized to be a 'stone'. This is statistics working from us, and with more data points the accuracy of the prediction increases. It is also worth noting that of the 10% tolerance with 11 data points, 5% of that is the likelihood of the new work item being larger than your current largest work item, and 5% is the likelihood that it is smaller than the current smallest. Most of us aren't going to worry to much if it comes in smaller, so you have 95% likelihood that the new work item is smaller than your current largest.

Remember we work in a Complex Adaptive scenario with continually changing context, and each Work Item requirement is also Complex and Adaptive, so we have multiple levels of emergence in the work we do. So no matter how good we think we are at estimating (and if you don't measure back on completion you don't know how good you are, and that is even more Failure Demand to work it out) we are at best going to come out with roughly 60% accurate estimates. That number comes from empirical study.

So with simple (quick) sizing with low amounts of waste we get accuracy above 90% as soon as we have completed 11 work items (and most teams have already completed many more than 11). With Estimation which usually happens after sizing has already happened - we often call sizing grooming or story refinement in that scenario - we only get 60% accuracy. I'm not here to tell you that you're doing it wrong, but if you are spending ** more effort to get 30% less accuracy** in your predictions, you might want to have a think about

**you are doing the estimate in the first place.**

*why*Remember, if you are going to stop estimation, you need to find a way to replace the real value in estimation - the shared tacit knowledge. Even if that means you keep on estimating only to get that benefit, it seems like a useful thing to me. However, if you do keep on estimating, please do yourself a favour and throw away the estimate at the end, it's doing you no favours. Do the maths instead. It's faster and easier, and gives much more predictability that your stakeholders will love, and will make your job much less stressful.

Tweet