MTBF, MTTF, and MTTR sure look like alphabet soup. Fortunately, though, these abbreviations seem super technical, they’re actually the shorthand version of three easy-to-understand concepts:
- Mean time between failures
- Mean time to failure
- Mean time to repair
In the IT world, these terms often apply to the hardware, big and small, that’s associated with data center management. If you employ the IT system management (ITSM) framework, these designations can help you plan your short- and long-term purchasing budgets as well as ensure you’ll have enough replacement products for when a non-repairable product does fail.
Let’s take a look at each.
Mean time between failures (MTBF)
This prediction uses previous observations and data to determine the average time between failures. MTBF predictions are often used to designate overall failure rates, for both repairable and replaceable/non-repairable products.
Here is the simplest equation for mean time between failure:
MTBF=total operational uptime between failures / number of failures
Let’s look at an example. Assume a manufacturer has recorded the following data points between product failures for one of its copier models:
Failure number | Recorded Operational uptime prior to failure (in hours) |
---|---|
1 | 10,000 |
2 | 9,500 |
3 | 11,000 |
4 | 9,000 |
Total | 39,500 |
Using the above equation, we get:
MTBF = ((10,000 + 9,500 + 11,000 + 9,000)/4) = 9,875
That means the manufacturer has approximately 9,875 hours of uptime on this copier before it experiences a failure.
Of course, for MTBF calculations to be meaningful and more reliable, many more data points would be required. The benefit of multiple data points means the more accurate your MTBF predications, but the drawback is complexity. Many products that can fail have a variety of subsystems (consider all the hardware of a server, for instance, with disk drives, fans, motherboards, etc.), so a more precise MTBF prediction would need to calculate on all those points.
Mean time to failure (MTTF)
Similar to MTBF, the mean time to failure (MTTF) is used to predict a product’s failure rate. The key difference is that MTTFs are used only for replaceable or non-repairable products, such as:
- Keyboards
- Mice
- Batteries
- Desk telephones
- Motherboards
Though the equation is similar to MTBF, MTTFs actually require only a single data point for each failed item.
MTTFs apply to two types of replaceable/non-repairable products:
-Replaceable products that cannot be repaired. When a product fails once, it must be replaced. The MTTF helps your IT department know when to expect products to turnover (fail), so they can maintain a proper supply for these instances.
- Examples include keyboards, mouse devices, and desk telephones, which are always replaced, never repaired.
- Importantly, some network appliances are replaceable (non-repairable), too, including certain firewalls, switches, modems, and other networking equipment that may be sealed units that run for years.
-Replaceable subsystem product within a repairable product. The smaller subsystem may fail, but with its replacement, the larger product need only be repaired, not replaced entirely. In this scenario, the MTTF helps you budget: What smaller subsystems must you spend money to replace in order to keep down the overall cost of the larger system?
- Examples include hard drives, which generally fail as they age, and batteries within an uninterruptable power supply (UPS) system, which commonly last for 2-3 years despite the larger UPS lasting for several years longer.
Mean time to repair (MTTR)
MTBF and MTTF measure time in relation to failure, but the mean time to repair (MTTR) measures something else entirely:how long it will take to get a failed product running again.
As MTTR implies that the product is or will be repaired, the MTTR really only applies to MTBF predictions. Conversely, items that are not repaired, that may have an MTTF number, will not have an MTTR prediction – because they will be replaced, not repaired.
Designating MTBF, MTTF, and MTTR products within ITSM
IT systems management (ITSM) is a broad, universal framework for management IT systems. Part of the framework is designating a variety of products under one (sometimes two) of these three terms.
While you can technically apply MTBF to both repairable and non-repairable items, it behooves an IT department to reserve the MTBF designation only for repairable items. Non-repairable/replaceable items should be designated under MTTR. It makes sense practically, but it’s also a way of separating long-lasting products that require big money from lower-end, short-term products.
Additional tools, like enterprise software that tracks these three designations, can help your data center become more reliable and better prepared, both financially and physically, for pending failures.
How reliable are these designations?
Data-minded folks say that these designations, when applied to IT, are only as good as the data they rely on. If you’re not measuring uptime, your guess on downtime will be just that – a guess. So, the opposite, that great data means great output, must be true: that these estimates will give great, accurate, and reliable indicators related to failure and repair. Right?
Not so fast. Some in the IT and engineering fields indicate that there’s no way to know more about failures unless you also consider more technical measurements like dispersion of time to failure data. These calculations are simple; initial forays into using statistics and applying them to technology systems.
At best, MTBF and MTTF are an average of previous activity. Depending on your position, you may consider these estimates useful or a crude approximation of what statistics can be used to do. Consider how these metrics can be useful in your environment – but also how you may need to go deeper for a better estimation of when failures will occur.