It is critically important to understand the reliability of software in production, this goes beyond uptime. As products become more and more distributed and interconnected, its no longer simple to calculate these statistics. If they are not understood it can leave a company open to litigation for not meeting SLAs, and whats possibly worse, a negative perception of reliability in the marketplace.

## Success!

Quick detour: I often like to add pictures to spice up the posts and while searching for success, I stumbled upon this gem. All I could think of was success in the 80s was a hell of a lot different that success today.

Anyways, Success Criteria is the number one most important thing that HAS to be done before anything else. *If you don’t define the success criteria, you’ll never achieve it. *

Lets use an example. I asked for a hammer, and bringing back any hammer would be success whoohoo! The problem is there are tons of different hammers in all different shapes and sizes for different applications. I need to hang a picture probably a 16-24 oz and a claw hammer might be the best choice, but if you brought me a scaling hammer, it wouldn’t be very useful. If I need to breakup concrete, a sledge hammer is great, but a rubber hammer would just be pure comedy for onlookers. (For more info on hammers check out http://www.diydoctor.org.uk/projects/different-types-of-hammers.htm)

In software you must understand all the components that can fail, how they will fail, what stressors can cause failures to happen, and the magnitude of the resulting failure. In more complex systems it becomes more and more challenging to test all the aggregated possibilities for success and failures, and where unit testing becomes critical. If I know the probabilities of each of my pieces failing, it am better equipped to understand the reliability of my system as a whole.

## Probability

There are 2 applicable types of probabilities “a priori” and “a posteriori.” A Priori is the calculated or estimates of what will happen without testing of the system. A Posteriori is the probability based on testing the system.

#### A Priori

If there are **n** total results that are equally likely, and **x** results that are considered successful, then you can calculate the probability of success by dividing **x** by** n**.

#### A Posteriori

Based on n number of attempts, and f(n) is the successful results observed, then f(n) divided by n is the statistical probability, empirical probability, and the observed reliability.

### Probability Of a Dice Roll

Lets say we have a single die that we assume to be equally balanced on all sides and rolled equally randomly. If we say rolling a 5 or 6 is success then we can calculate the A Priori probability of success as 2/6 or 33.33%. That is 2 different ways of being successful, and 6 equally likely possibilities.

When we roll the die 1 time we get a 2. Awesome 100% success of 2. But unfortunately we haven’t tested enough times.

When we roll the die 250 times we get the following results

Result | Count | Probability |

1 | 42 | 17% |

2 | 40 | 16% |

3 | 48 | 19% |

4 | 42 | 17% |

5 | 37 | 15% |

6 | 40 | 16% |

Yikes, it looks like we should bet on 3 and stay away from 5s! Lets try rolling 1000 times:

Result | Count | Probability |

1 | 167 | 17% |

2 | 159 | 16% |

3 | 167 | 17% |

4 | 176 | 18% |

5 | 169 | 17% |

6 | 161 | 16% |

Now we are getting somewhere! A Posteriori tests should start to approach the A Priori tests over time. If your a posteriori doesn’t start to approach your a priori probabilities then you’re underlying assumptions are probably wrong and you need to evaluate them again.

### Probability Rules

- The probability of success is inversely proportional to the probability of failure. If the probably of success is R and the probability of failure is Q Then R + Q = 100% (or R + Q = 1).
- Two independent events likelihood of success is the probability of the first success multiplied by the probability of the second success. So if R1 is 75% and R2 = 50% then the probability both will be successful is 37.5% (0.75 * 0.50 = .375).
- The probability of a specific outcome of 2 events is the sum of all the possible combinations. For example if you want to know the probability that you roll a die 2 times and get a 1 or 2 both times you have these distinct events that can happen:
- the first roll is a 3 or more -> Stop
- the first roll is a 1 or 2 and the second roll is a 3 or more -> Stop
- both the the first and second roll are a 1 or 2 -> Success!
- Q1 = the probability of failure for the first roll (4/6 or 66.66%)
- R1*Q2 = the probability of success for the first roll times the probability of the second roll failing (2/6 * 4/6 or 22.22%)
- R1*R2 = the probability of rolling a 1 or 2 both rolls (11.11%)
- What is most useful here is that all possibilities should add up to 100% (66.66% + 22.22% + 11.11% = 100%).

## Reliability

Reliability is the probability that a system will perform a required function under stated conditions for a specified period of time. Agin, the most important part of the definition of reliability is that Success Criteria is defined. Broken Down:

- perform the required function: This is another way of saying the Success Criteria of the system is met
- stated conditions: Defining the parameters which the system will run under
- a specified period of time: A minimum time period of which it is running.

One thing missing here is the number of usages. For example if you were testing a shock absorber, it is more likely to be higher than the reliability rating on smooth asphalt where its not being engaged as frequently, and will approximate the reliability rating on rougher terrain because it is being used more frequently, even though the usages are within the stated conditions.

## Adjusted Reliability for Software

Software requires a slightly different definition than general reliability, especially as we start working with distributed systems. Reliability is the probability that a system will perform a required function **without critical failures** under stated conditions for a specified period of time. The change is without critical failures

Why is this important? Lets take the example of the software powering a modern insulin pump. In our example the insulin pump connects to your phone for logging, monitors glucose automatically, and distributes insulin as needed based on the glucose readings.

This is a critical device that people need to survive and its important to understand the Reliability of such a device. Here are some made up probabilities of the insulin pump components:

- Glucose Reading: 99.95% reliability
- Glucose Distribution: 99.995% reliability
- Connectivity to iPhone: 80%
- Logging: 95%
- Glucose Monitor Screen Readout: 90%

If we define all 5 functions a critical we get an overall reliability of just 68.36%!!! But if we say we absolutely need the reading, logging, and distribution of glucose and we need the screen readout, we improve to 85.45%. If we remove the monitor screen readout we further improve to 94.95% reliability. In the real world I wouldn’t bet my life on a 95% reliability glucose monitor but it illustrates the point.

## Leave a Reply