Archive for April, 2008

The more times a program is run successfully does not decrease the chance of failure


A common mistake that is often made by businesses is to trust software systems that have run for a long period of time.  These systems are usually considered trustworthy because all known issues have been “shaken-out”, which means that all problems related to the business have been corrected.  Systems that become a tried and true piece of the business may become their Achilles’ heel.  The point that businesses miss is that the longer a system runs, the greater the likelihood that the system architecture will fail.  There are multiple reasons for this, but two of the main reasons are as follows.

When software systems are built, there is a tendency to overlook the lifetime of the system in the design of the system.  All software has a point of failure, but it is rarely determined or taken into account when building systems.  When this point of failure occurs, it can lead to a catastrophic failure causing the system to no longer function, which is the best case scenario.  This is the best case scenario because the worst case is that the system continues to appear to function properly but is actually destroying your business data.

The second reason for failure is that systems are altered over time, increasing the chance of failure.  Rarely does the person altering the system have the same understanding of its function as the initial development group (I may be falsely assuming that the initial development group was competent).  These changes are usually attempts to correct some mistake in the business process logic, but sometimes it can be changes to the underlying architecture driven by some business need.  Taking one brick out of a building is not likely to cause a problem, but deleting even a single line of critical architecture code can yield dire results.

Are businesses doomed to have their critical software system fail at the worst possible time?  Not if they take action to mitigate the risk association with system failure.  Unfortunately, this requires an incredibly difficult step – businesses must change from being risk adverse to risk conscious.  As described earlier, businesses mistakenly tend to trust systems that have run for a long time without major failure, but businesses should never trust their core operational systems!  There should always be a fall-back system that can be taken if the core system fails.

These fall-back systems should be able to run the business in case of the main system failure, but they can be significantly simplified to meet only the critical business process needs.  These systems also have the added side benefit of being proof of concepts for redesign of critical system.  Instead of critical systems being tied to one specific implementation, the business will have other options which can be leveraged in the future.  The critical point to understand is that creation of system business process logic is expensive, time-intensive, and risky while designing and architecting software systems is not.  Businesses should assign a small portion of their staff to constantly research how to improve their critical system along with building secondary backup systems for all critical business systems.



Read Full Post »