What are the two primary concepts of system redundancy?

They are:

  • Process redundancy, i.e. storage capacity in a system
  • Equipment/Technology Redundancy

Process Redundancy is the amount of spare capacity built into the resources or physical assets in a system. Maximum downtime allowance for specific components within a system is often directly related to the process redundancy capacity. Maintenance staff are usually very clear about these requirements and deadlines. Examples include; the storage capacity in a warehouse, amount of water in a reservoir, volume of product contained in a pipeline, battery capacity in a UPS

 

Equipment Redundancy generally includes all of the infrastructure and software systems used for monitoring, controlling and reporting of a process plant or systems. It includes telecommunications and any transportation of resources required to facilitate normal operation, i.e. all non-process redundancy items.

 

The purpose of this discussion is to review a few key areas when designing equipment redundancy. Although not explored here, process redundancy should be considered at all phases of equipment redundancy as it may negate its need in certain circumstances.

When is a RAM STUDY (Reliability, Availability and Maintainability) required?

There are a number of reasons an organisation may conduct a RAM study. Firstly, if the system availability is known (perhaps by a specified standard or based on experience) a new system design may be reviewed by conducting a RAM study. The results of a RAM study include design artefacts that can be forever referenced throughout the lifecycle of the asset or its automation system. If a component changes (or there is fundamental change to the way the asset operates) the design should be reconsidered against the new inputs. Alternatively, when a design is determined based on accepted best practice ie a defined architecture supplied by a vendor, a RAM study may be considered to establish what the system availability is and therefore the operational performance can be measured against a reasonable expectation. The latter approach is not typical for critical infrastructure solutions. Usually critical plant operators know the tolerable down time over any given period and therefore the design is developed against this benchmark.

 

The RAM Study is a very powerful too to help the designer’s evaluation the cost benefits of including or excluding elements of a system design and then presenting these permutations for acceptance or exclusion.

 

Equipment Redundancy Components on Process Control & Automation Software Systems

Equipment used to monitor, control and provide information could include some or all of the following components:

  • Telecommunication Infrastructure (PSTN, Fibre Optic, Radio, Cable)
  • Information System Database and Web Servers
  • Server and Workstation Equipment, including operating systems
  • HMI and SCADA software systems, middleware and software drivers
  • Distributed Control Equipment (DCS, PLC, & RTUs)
  • Distributed Control Equipment embedded controller operating systems, configuration software and sequential logic code.

Systems are generally very complex in nature. What appears to be simple in concept becomes extremely complicated in detailed form when redundant components are added.

Even the Basics of Redundancy can become complex very easily

When doing detailed design, the level of redundant equipment should be considered for each component in the system. Each single point of failure should be reviewed and the consequences of failure. In addition to reviewing the likelihood of this occurring, the procedure to remediate the condition should be defined whether redundancy is going to be applied or not. The basis of all redundant component decisions determined during design should be documented thoroughly so that the wisdom gained about the system maintainability is recorded for further use.

 

Product suppliers at the very least should define under what conditions redundant components are actuated and returned to normal. It is worth investigating what parts of the system are compromised when operating in redundant modes and if there are any performance trade-offs. For example, when a software system operates on its secondary server, do all client connections operate with the same performance as the primary system? If not, does the effect of degraded performance exacerbate the situation for users and therefore further degrade system performance as operators overcompensate? What are all of the conditions which allow a system to automatically fail to secondary systems? Do operators and users become intimately aware? Is the system permitted to return to normal (TRN) without intervention or do maintenance personnel need to make an informed decision to restore the system after a formal or informal assessment has been made?

What are some common traps when implementing redundant systems?

Generally, it is thought that a set of rules must be applied to fail a system over to the backup system. For example, if Condition A occurs, then use the Backup System. The assumption is often made if Condition A disappears, return the system to normal. This assumption has several problems. User interaction and the ability to interact with operational plant is the second dimension to be considered.

 

When a process moves to a condition that is abnormal, it may not be desirable to return the system to normal by reversing the action it took to get there in the first place. It may also be desirable to stage the return to normal process.

 

A further input to management of redundant system is the operators and controllers who intercede or are required to provide input. Often it is mandatory to have users control the process of returning a system to normal after visual checks are made and certification processes completed. This may include making records of what has been observed and the “reason code” for failure. Even initially selecting failover to the standby system may require user confirmation. A strong advantage with having operator intervention is it prevents any scenario of systems hunting between primary and standby systems by the redundancy switching logic. Additionally, it forces a verification process to occur that the original fault has been cleared or that the primary system offers a better solution than the standby system does. This type of intuition is often very difficult to imitate with logic controllers, especially if the secondary system is functional but also in a degraded performance condition.

Some systems implement a state logic “machine” to provide traceability of the conditions surrounding a change of operation. Depending on how such a machine is visually displayed, it can greatly enhance maintenance or operational personnel’s ability to make better judgements or anticipate what will occur next in an expected sequence of events. This is particularly useful when restoring a system back to normal operation.

What is SIL Rating?

Safety integrity level (SIL) is a relative level of risk-reduction provided by a safety function, component or a system. It is a risk target or level of acceptance that an operator may tolerate or must comply with to operate plant by license.

Why is SIL Rating important to Industrial Automation and Redundancy?

Not all Industrial Automation systems require “safety in design” practice or a SIL rated outcome, even though all systems should be “safe” to operate. Many systems which demand high levels of system availability (very small down time) need redundant systems or network components. This should not be confused with SIL Rating, even though on many critical plant situations there is an overlap of principles being applied. At the control systems or process level this could lead to very complex systems design. In a safety system, equipment must meet particular requirements which may include, dual CPUs, dual power suppliers, dual communications paths, configuration and fail-safe functions which ensure the integrity of how control sequences are executed reliably.

Complexity peaks at failure point

The return to normal process is by far the most complex undertaking in redundant systems. Often this is not understood and is therefore overlooked. The type or degree of redundancy should also be clearly defined by product suppliers and implementers but is usually overlooked. Through testing of both (1) failure and (2) return from failure, ensures plant operators can safety maintain their systems operating through modes that rarely occur. Due to the nature of failures occurring very rarely, many operators systematically test their redundant systems on a regular basis. This allows operational processes to be practiced, helping personnel become familiar with operating in the secondary state, and also tests secondary equipment operational integrity in a more controlled situation. Sometimes the redundancy offered by vendor equipment or the systems design may not provide an acceptable solution. A few of many examples follow which highlight issues that are never black or white:

  • Using dual CPU Modules on a PLC device sharing a common backplane bus. This offers a degree of redundancy for processing etc but under certain circumstances the failure of standby or primary CPUs may shut down the entire system.
  • Relational Databases may be synchronised or replicated across the LAN connecting them. This may be prohibitively expensive and therefore often not taken up. Errors on one system may be replicated to the secondary system.
  • Lock Step replication (data centre IT infrastructure replication) has the propensity to replicate software faults into the secondary system. This type of redundancy truly only offers hardware redundancy and limited software-based redundancy. Vendors must provide integrated application redundancy to overcome software failure in systems.
  • Radio Infrastructure and propagation conditions are often very difficult to duplicate. The optimum physical position for an antenna is normally chosen at the best possible location. A secondary position may not even be available for reliable communications. Therefore, alternative communications media may be implemented for selective system components or additional cold swap spares allocated to maintenance staff to facilitate a rapid restoration of services. Secondary systems must be managed as “different” to the primary system because they are rarely identical in terms of operational performance.

How important is system design when a product is redundant?

Sometimes designs must be improved to reduce the risk of failure rather than catering for what is considered unlikely. High quality components and more thorough testing may increase the confidence levels to a point that overrides the requirements to implement redundancy. Whatever the final design includes, it should at least contain the following:

  • A description of the philosophy of redundancy for process and equipment.
  • A definition of the user interface and the user’s authority to change the process.
  • Monitoring procedures for operation of redundancy and all component failures.
  • Detailed workflow diagrams or descriptions of how standby systems are to be selected and the sequence for returning back to the primary system. This should be done for both automatic and manual operation.
  • Periodic testing regime to reduce operational risk.