There is nothing simple about managing industrial automation software systems.
- Why do we need to update our SCADA or Industrial Automation software?
- How often do we need to upgrade our Industrial Automation software?
- Why are staging environments important?
- How many staging environments do I need?
- What is configuration management and why is it important?
- High level recommendations to avoid disaster
Applications aside, many factors are interwoven, and a narrow focus approach is a system killer. The Application guy may say, “let’s load the latest” (because that gives the latest features), but what are we agreeing to (or recommending) when that is the single topic of discussion? The approach to managing software systems in modern times has been elevated to a critical systems thinking for many organisations even if the organisation is not operating “critical infrastructure”. This thinking has been further extended so that software is now part of the safety consideration that must be assessed.
Remotely operating plant, equipment and any assets have become the latest expected practice, however the human element, arguably the most important part of remote management should have had a step change in terms of security, staging practice and systems engineering practice. Our guess is that many software systems have jumped forward in terms of functional capability and most likely security. However, we have significant doubts that the systems management practice and business processes are keeping pace with the increasing demand for ensuring software systems are safe and secure to operate.
Why do we need to update our SCADA or Industrial Automation software?
Whether it is industrial automation software, a desktop application, your garage door controller app or server software, it all requires maintenance. There is a myth that suggests that software systems are black box and doing something different every time. The truth is software is deterministic. In this case, it means the system does what it is supposed to do in an expected way. What cannot be seen by the naked eye is that the underlying environmental factors and data are constantly changing. This fundamental principle must be internalised to fully appreciate that software systems are repeatable, but the environments that they operate in are never exactly the same.
“Updating a system” is a general umbrella term that could include an upgrade of software, applying a major patch, a minor hotfix or perhaps even a complete reinstallation. By any measure, the actual step of doing an “update” could be very simple or very, very ugly.
Operating system providers (eg Microsoft) are constantly releasing updates for end users and system administrators to keep their computer systems up with the latest. The primary reasons for constant updates are bug fixes and closing security vulnerabilities. This happens so often that software components like Windows Update Service (WUS) exist to make the job so easy you don’t even have to worry about it, or do you?
As a software user, we are all in a predicament to load the latest software to remain “safe” or wait a little while and see if the world falls apart. From a software vendor’s perspective, it makes economic sense to have all software users on the same version of software. This reduces the costs associated with supporting legacy software and the complexities associated with having experienced developers on the team to support all versions of software in the open market. This is one of the reasons software reaches “end of life”, even though the software still works. This is an underlying motive to getting users to update software as soon as possible.
How often do we need to upgrade our Industrial Automation software?
Many industrial automation software vendors release new versions of software annually. Sometimes intermediate releases occur. For security vulnerabilities or critical bug fixes, patches are released either in the impact of a release is carefully monitored. With industrial automation software, the user group is relatively small compared to desktop applications like Microsoft Word.
Managers of Industrial Automation systems typically schedule their upgrades based on factors including the risk of changing something with known operational performance, what are the benefits of the upgrade, are the current patches applied keeping pace with the latest security vulnerability remediation, is there a hardware refresh pending, just to name a few. The important point here is the IT infrastructure managers have their own plan and should not mirror the wishes of the vendors. This is an important principle, because as we said earlier, though the software is exactly the same no matter where it is installed, the environment is never the same, so how it operates may be the same most of the time, but never with 100% certainty. How software is staged into production is the key mitigation step to move the certainty measure as close to 100% as possible.
Even if you have just purchased software for first time use, and you install it onto a new hardware platform (or virtual environment), there is still no certainty that the latest clean install still doesn’t need to also be patched
Why are staging environments important?
A staging environment e.g. an environment that closely matches the destination production system, provides a “sandbox” to implement new versions, patches or hotfixes to determine if it plays nicely.
Often critical infrastructure managers have several staging environments, namely test, development, pre-production, and production. This may seem excessive at first glance, however, when you are dealing with critical situations this is the bare minimum required. Let us explain why.
When a system has regular configuration changes occurring, a development environment is a safe place to play without interrupting any formal staging process. This is an environment where “anything goes”. The development environment (or multiple instances of it) is usually instantiated on lower grade computer hardware, which cannot compare to the production system. This is not altogether a bad thing, especially for large software systems where performance testing is an essential mitigation step. The environment is expendable and can be reinstated “clean” when new stuff is to be applied.
In an ideal world, the next level of staging is a test environment. This environment should be maintained closely to the production environment. Though end user testing (UAT) will occur in this “sandbox”, custom code should be quarantined elsewhere. UAT on testing environments helps system administrators take the obvious software deficiencies off the table. This helps to significantly reduce exposure to the risk attached to rolling out the latest and greatest, when it may not be what is expected. The testing environment may also incorporate testing devices (eg IoT) and business interfaces such that “real world” test data can be used rather than simulated data only.
Pre-production environments usually match the production environment. If the production environment has redundant functionality (a significant complexity that must be considered in its own right), then pre-production should ideally match this functionality, along with any other significant architectural features. The systems should be as close to identical as possible. Being able to stage redundant functionality before it is deployed to the production environment moves the envelope closer to the 100% certainty we mentioned earlier. The complexity of redundant systems cannot be overstated and even if the software servers can be adequately tested, the network interfaces to each environment will be unique to their own system.
Each of these stages focuses on improving certainty, validating that previous functions that worked continue to work, new functions work as expected, reducing operational risk, improving administrators familiarity with new software and overall getting ready to begin another cycle of finding and resolving the issues quickly. In all of this, staging is attempting to prove the software systems and hosting software operate with improved confidence, not the network or business system interface performance.
There are significant risks to consider when updating application software that has special functional use or is large scale. In addition, other areas of risk with software updates are related to retrospective testing of interfaces, APIs or special functions that the average user may not make use of or that the software vendor does not have a specific testing harness to validate. We all expect the QA process to capture everything, but it doesn’t. So, a rule of thumb is, check for patches even for clean installations, and just because it worked in the last version, does not mean it will always work going forward. Plan to test the “special cases” yourself.
How many staging environments do I need?
We have outlined the ideal situation above, however this may not be practical to fund or manage for small organisations. For very small organisations there may be a strong argument that no staging environments are necessary, and all risk is born by the organisation for “rolling” out updates automatically. Even though single stage strategies (or no strategy) occurs more frequently than it should, service providers can provide a degree of safety by having a customer aligned staging environment in their own development lab. Again, this measure of testing is in terms of core software functionality, not the server infrastructure or business system interfaces.
Our recommendation is you need to stage all software, no exceptions. If your software support partner doesn’t have a lab which can mirror your specific software environment, you need to find a partner that can.
What is configuration management and why is it important?
Putting the operating system changes and the application updates to the side, the primary focus of this article so far, the most vulnerable area and the one most often overlooked that deserves particular attention is the management of configuration. How a system is uniquely setup for use, and how the previous major changes and all minor changes in between, are tracked, applied, and stored for rollback and forensic purposes. We have saved the best and most controversial topic for last.
When a software platform has been staged and there is a high level of confidence (it is not generating exception logs, interface connect errors, etc), the application itself still needs to be proven as functional. Why is this important? Application developers are usually (or should be) certified partners of the operating system vendors. This means they are already doing development against pre-release operating system software. This is complex to consider because the pre-release software is also constantly changing so it’s still a “work in progress”. Though good measures are taken for application development to be aligned with the OS release, applications are always playing catchup, and an OS patch may already be available at the same time the Application Software is released.
For industrial automation software providers, the criticality of the environment is heightened. It simply must work. You cannot throw a software release out overnight and “fix it” according to customer feedback. This is the documented strategy for many App providers. This is simply not acceptable for operational environments, though the sentiment can subtly leak into organisations that have soft configuration management practice.
Usually there is a significant delay between when the operating system release is made available and when the automation software release is officially supported on the new iteration of the operating system. Careful reading of automation software release notes reveals what is formally supported. Moving away from the supported OS version is a risk and is likely to be in breach of the end user license agreement. The major challenge is, what do I do if I really need to load the latest OS patch for security reasons and the application vendor will not confirm it’s a goer? Unfortunately, configuration management responsibility (application or hosting software) will always remain with the user of the software. There are no shortcuts to managing your unique configuration, after all you configured it just the way you like it, right? The onus is on the software platform manager to decide to wait or go ahead with caution.
High level recommendations to avoid disaster
There are no explicit rules that fit every system. We have outlined just a few of the important factors to consider when considering updates to industrial automation software and interconnected systems. The most important thing to remember is, you cannot take a narrow view of how to proceed. A systems view is essential.
Given we cannot give you a silver bullet, what can we suggest?
- Audit your systems using Assess your acceptable risk profile and system maturity and holistically stick to the plan.
- Formally check your systems regularly. You are looking for changes in performance. It requires maintenance, not because your software is broken, but because your external systems (and configuration) are constantly changing.
- Don’t automatically load any software. Manage it, step by step. Assess the risk.
- Stage software roll outs for the operating system. Reassess the risk.
- Stage software roll outs for application software. Reassess the risk.
- Stage software roll outs for configuration changes. Reassess the risk.
- Document all changes, backup all changes, and plan for roll back every single time.
- Check all critical functions operate after every update, particularly major releases. No assumptions for even the basics.
This approach has been formulated based on years of managing critical, and not so critical, software systems. Did you notice how many times we mentioned the word risk? This is not to make anyone concerned or nervous about how to proceed. Risk is the chance or probability that an unwanted outcome may occur. We believe with industrial automation systems the risks are preventable or at the very least manageable, if there is a propensity to face them head on. While on the topic of risk you may want to take a look at our take on Critical Infrastructure Cyber Security Solutions.
With good practice we can all move closer to 100% certainty.