Preempting System Issues

Complete Developer Podcast - A podcast by BJ Burns and Will Gant - Joi

Podcast artwork

Categories:

Simple systems fail simply. Complex systems also fail simply, but their interconnectedness with other systems makes mitigating failures much more complex. Past a certain level of complexity, system failures are an emergent property of the system – that is, the set of system parts has a set of failure cases that the individual parts do not have by themselves. This means that it is more difficult to predict what can go wrong with a system. At some level, prediction is nearly impossible. However, you can predict many of the things that are likely to cause problems, simply by engaging in a few fairly simple thought exercises, you can greatly reduce the number of unexpected problems that your system encounters. While it can be tempting to wait until a problem occurs to try to mitigate it, this is unwise in a production system that other people are dependent on. A system failure usually costs money at a minimum, and the problems can be far more severe than that. As a result, it’s common for software services to include a Service Level Agreement or SLA, that dictates expectations about the frequency of system outages, response times, and time expected to complete work. Even if your system is engineered so that it doesn’t completely fall over when a problem occurs, it can still violate an SLA and cost money. The consumers of your application probably have their own clients who have their own expectations. SLAs tend to bleed inward from clients to the services that they use and then to the services that those services use. In contrast to SLAs, systemic problems, including both errors and latency tend to bleed outward from one service to its clients and then to the clients of that service. As a result, when you are thinking about how to find potential systemic problems, it’s often best to think of these problems from two different angles. That is, you need to consider how errors and latency will bleed out as a result of a problem, while also considering how SLAs bleed in to put more stringent expectations on your system than you might expect. In effect, you are dealing with a balance between tolerance for errors and difficulty in error mitigation. Depending on how critical your system is to your clients, these expectations will vary. You can’t prevent every problem in a system, but you can usually prevent a large percentage of them by planning ahead. However, until you’ve encountered enough unexpected problems, it can be difficult to envision how something can go wrong, or even have a realistic thought process for thinking about what can go wrong. However, if you go through the thought exercises we’ve outlined here, then you have a good chance of preventing most of the problems that will plague a complicated application. While this doesn’t fix everything, it can give you enough breathing room to fix the truly unusual problems that you’ll occasionally encounter. Links Join Us On Patreon Level Up Financial Planning

Visit the podcast's native language site