There are two primary concepts of redundancy when considering process and automation systems.
- What are the two primary concepts of system redundancy?
- When is a RAM STUDY (Reliability, Availability and Maintainability) required?
- Even the Basics of Redundancy can become complex very easily
- What are some common traps when implementing redundant systems?
- What is SIL Rating?
- Why is SIL Rating important to Industrial Automation and Redundancy?
- Complexity peaks at failure point
- How important is system design when a product is redundant?
What are the two primary concepts of system redundancy?
They are:
Process Redundancy is the amount of spare capacity built into the resources or physical assets in a system. Maximum downtime allowance for specific components within a system is often directly related to the process redundancy capacity. Maintenance staff are usually very clear about these requirements and deadlines. Examples include; the storage capacity in a warehouse, amount of water in a reservoir, volume of product contained in a pipeline, battery capacity in a UPS
Equipment Redundancy generally includes all of the infrastructure and software systems used for monitoring, controlling and reporting of a process plant or systems. It includes telecommunications and any transportation of resources required to facilitate normal operation, i.e. all non-process redundancy items.
The purpose of this discussion is to review a few key areas when designing equipment redundancy. Although not explored here, process redundancy should be considered at all phases of equipment redundancy as it may negate its need in certain circumstances.
When is a RAM STUDY (Reliability, Availability and Maintainability) required?
There are a number of reasons an organisation may conduct a RAM study. Firstly, if the system availability is known (perhaps by a specified standard or based on experience) a new system design may be reviewed by conducting a RAM study. The results of a RAM study include design artefacts that can be forever referenced throughout the lifecycle of the asset or its automation system. If a component changes (or there is fundamental change to the way the asset operates) the design should be reconsidered against the new inputs. Alternatively, when a design is determined based on accepted best practice ie a defined architecture supplied by a vendor, a RAM study may be considered to establish what the system availability is and therefore the operational performance can be measured against a reasonable expectation. The latter approach is not typical for critical infrastructure solutions. Usually critical plant operators know the tolerable down time over any given period and therefore the design is developed against this benchmark.
The RAM Study is a very powerful too to help the designer’s evaluation the cost benefits of including or excluding elements of a system design and then presenting these permutations for acceptance or exclusion.
Equipment Redundancy Components on Process Control & Automation Software Systems
Equipment used to monitor, control and provide information could include some or all of the following components:
Systems are generally very complex in nature. What appears to be simple in concept becomes extremely complicated in detailed form when redundant components are added.
Even the Basics of Redundancy can become complex very easily
When doing detailed design, the level of redundant equipment should be considered for each component in the system. Each single point of failure should be reviewed and the consequences of failure. In addition to reviewing the likelihood of this occurring, the procedure to remediate the condition should be defined whether redundancy is going to be applied or not. The basis of all redundant component decisions determined during design should be documented thoroughly so that the wisdom gained about the system maintainability is recorded for further use.
Product suppliers at the very least should define under what conditions redundant components are actuated and returned to normal. It is worth investigating what parts of the system are compromised when operating in redundant modes and if there are any performance trade-offs. For example, when a software system operates on its secondary server, do all client connections operate with the same performance as the primary system? If not, does the effect of degraded performance exacerbate the situation for users and therefore further degrade system performance as operators overcompensate? What are all of the conditions which allow a system to automatically fail to secondary systems? Do operators and users become intimately aware? Is the system permitted to return to normal (TRN) without intervention or do maintenance personnel need to make an informed decision to restore the system after a formal or informal assessment has been made?
What are some common traps when implementing redundant systems?
Generally, it is thought that a set of rules must be applied to fail a system over to the backup system. For example, if Condition A occurs, then use the Backup System. The assumption is often made if Condition A disappears, return the system to normal. This assumption has several problems. User interaction and the ability to interact with operational plant is the second dimension to be considered.
When a process moves to a condition that is abnormal, it may not be desirable to return the system to normal by reversing the action it took to get there in the first place. It may also be desirable to stage the return to normal process.
A further input to management of redundant system is the operators and controllers who intercede or are required to provide input. Often it is mandatory to have users control the process of returning a system to normal after visual checks are made and certification processes completed. This may include making records of what has been observed and the “reason code” for failure. Even initially selecting failover to the standby system may require user confirmation. A strong advantage with having operator intervention is it prevents any scenario of systems hunting between primary and standby systems by the redundancy switching logic. Additionally, it forces a verification process to occur that the original fault has been cleared or that the primary system offers a better solution than the standby system does. This type of intuition is often very difficult to imitate with logic controllers, especially if the secondary system is functional but also in a degraded performance condition.
Some systems implement a state logic “machine” to provide traceability of the conditions surrounding a change of operation. Depending on how such a machine is visually displayed, it can greatly enhance maintenance or operational personnel’s ability to make better judgements or anticipate what will occur next in an expected sequence of events. This is particularly useful when restoring a system back to normal operation.
What is SIL Rating?
Safety integrity level (SIL) is a relative level of risk-reduction provided by a safety function, component or a system. It is a risk target or level of acceptance that an operator may tolerate or must comply with to operate plant by license.
Why is SIL Rating important to Industrial Automation and Redundancy?
Not all Industrial Automation systems require “safety in design” practice or a SIL rated outcome, even though all systems should be “safe” to operate. Many systems which demand high levels of system availability (very small down time) need redundant systems or network components. This should not be confused with SIL Rating, even though on many critical plant situations there is an overlap of principles being applied. At the control systems or process level this could lead to very complex systems design. In a safety system, equipment must meet particular requirements which may include, dual CPUs, dual power suppliers, dual communications paths, configuration and fail-safe functions which ensure the integrity of how control sequences are executed reliably.
Complexity peaks at failure point
The return to normal process is by far the most complex undertaking in redundant systems. Often this is not understood and is therefore overlooked. The type or degree of redundancy should also be clearly defined by product suppliers and implementers but is usually overlooked. Through testing of both (1) failure and (2) return from failure, ensures plant operators can safety maintain their systems operating through modes that rarely occur. Due to the nature of failures occurring very rarely, many operators systematically test their redundant systems on a regular basis. This allows operational processes to be practiced, helping personnel become familiar with operating in the secondary state, and also tests secondary equipment operational integrity in a more controlled situation. Sometimes the redundancy offered by vendor equipment or the systems design may not provide an acceptable solution. A few of many examples follow which highlight issues that are never black or white:
How important is system design when a product is redundant?
Sometimes designs must be improved to reduce the risk of failure rather than catering for what is considered unlikely. High quality components and more thorough testing may increase the confidence levels to a point that overrides the requirements to implement redundancy. Whatever the final design includes, it should at least contain the following: