This is a repost from an older blog that ended up getting the most traffic for some reason. The post is pretty specific and I think most of the traffic was academic related. Anyway, here it is. Let me know what you think.
Business Continuity Planning for Industrial Control Systems
For anyone that is involved in Information Security or is enthusiastic about the technology, they have probably heard about the 2010 attack called Stuxnet. Stuxnet was a cyber attack on Iran’s nuclear program and is generally acknowledged to be the world’s first cyber weapon that caused physical damage (Zetter, 2014). A lesser-known attack, possibly the second cyber-attack to cause physical damage, occurred in 2013 in Germany. Just before Christmas, hackers attacked the control systems of a German Steel Mill company and disrupted the control systems enough to cause a massive explosion (Zetter, 2015). This attack is impressive because it illustrates the enormous responsibilities that companies have when they start connecting their control networks to the internet.
Industrial Control Systems (ICS) is a general term for components that connect the Information Technology (IT) world to the Operational Technology (OT) world. Today, everyone knows what IT is but OT is a little less familiar. OT is made up of different types of components that connect computers and servers to the devices that control operational technology. An example of an ICS environment is the oil and gas refinery industry. Refineries are made up of a significant number of pipes, tubes, and boilers that heat, filter, and extract crude oil and gas until it is refined enough to be used in chemicals and gasoline that we use in everyday life. Fifty or sixty years ago before the computers and mainframes became popular refineries were running by employee’s who would manual turn dials or manipulate the control systems until the desired outcome was received. When mainframe computer started becoming popular, these systems started being connected to internal networks. Supervisory Control and Data Acquisition (SCADA) devices were connected to monitor and supervise the control processors that were connected to it (Assante, 2014). Fast forward a few more years when personal computers and the internet became more readily available, and companies started to want their systems to be connected and accessed from the internet (Assante, 2014).
Just like any technology today, the control systems have always had design problems and vulnerabilities. What changed when these systems were connected to the internet was that these vulnerabilities were now exposed to the world along with all of the common vulnerabilities that come with being connected to the internet. ICS security is hard. Moreover, it is hard because control systems engineers and computer security engineers have two competing priorities. Control system engineers are solely focused on safety and uptime as priorities. Anything that gets in the way of that is a problem for these engineers. When security professionals want to change things, and bolt-on security software and appliances that the ICS units were never designed for there are problems (Antova, 2017). In the years since the 2010 Stuxnet attack, these industries that are dependent on industrial control systems have taken notice and are starting to make changes to make their networks more secure.
In today’s high demand economy, manufactures, refineries and other critical infrastructure industries are running 24 hours a day, seven days a week. These systems require more than just 99% uptime, a goal of which most companies strive for. These companies require 99.9999% uptime. While that might seem insignificant, the difference over the course of a year is 3.65 days to less than an hour. Less then four days of downtime over the course of a year would not be too be for most businesses but imagine what would happen at an airport if the air traffic control tower that controls upwards of 1000 flights are more per day. Using the example above you can understand just how vital it is for companies that need incredibly high availability to have a robust Business Continuity Plan (BCP) and Disaster Recovery Plan (DRP). Every hour of a disruption in their normal business functions could cost hundreds of thousands of dollars or more. For the purpose of this paper, the scope will focus on oil and gas refineries in the context of BC/DR planning.
One of the first steps needed before starting to plan the BCP or the DRP is to conduct a Business Impact Analysis (BIA). The BIA should identify all of the primary business functions and list any interruptions that could happen that would affect those operations. Anything from supply chain issues and power outages to terrorist attacks or sabotage. One difference between ICS environments and a typical business environment is that when things go wrong in a typical business, it would likely only effect that business. If a critical operational function is affected in an ICS environment not only is that operation equipment is affected but depending on what that equipment is doing there could be environmental and public safety issues to consider. Refineries often have to inform local government and give press releases describing what they are doing, why they are doing it, and how it affects the residents.
During the BIA process, the company needs to evaluate two key figures. The first figure is the Recovery Time Objective (RTO). The RTO defines the time required to recover the communication links and processing capabilities (Stouffer, 2015). If the network goes down while a refinery is running the refinery will not suddenly stop with it. It will keep running, but things will slowly start degrading from a lack of communication and control. On a long enough timeline, there could be catastrophic consequences so the RTO is critical to pay attention to and will be used later to trigger specific events in the BCP/DRP later. The next key figure is the Recovery Point Objective. The RPO defines the longest point of time that a network can be down and tolerated before adverse conditions start occurring (Stouffer, 2015).
Once the BIA has been completed, and the RTO and RPO are defined the BCP planning can begin. The BCP should take into consideration any interruption that affects operations in whatever form it may come in. It could be natural disasters, human mistakes or equipment failures, or even terrorist events. Each event needs to be evaluated in by a team that had a member from each discipline that is involved. There may be a control system engineer for the ICS equipment, an IT engineer for the network equipment, a chemical or petroleum engineer for the oil and chemicals, and possibly even a safety engineer or public relations specialist. All of these team members should report to the BCP manager who reports to an information security manager or BCP executive. Because of the nature of oil refineries, the legal department and compliance officers will often be involved because the oil refineries will have to release public information and inform the local, state, and federal governments when required.
BCP plans can and should contain smaller plans based on each worksite or business function. In the case of ICS networks, it will also consist of different units or systems. Each of these systems works together to complete a particular function, and there are very specific steps in which these units must go thru when powering down or powering up. It can often take hours or days to entirely shut down operations on a refinery unit.
In many cases, refineries will not be able to maintain normal business functions as is the case during inclement weather such as hurricanes, but the goal will be to have the least amount of impact on the business as possible. To achieve that goal, the refinery units must be shut down and powered up correctly so that there is no damage to the unit equipment. That would also require the companies to work with Meteorologist so that they can get the most accurate prediction possible to properly time when they need to shut down operations.
Another subsection of the BCP area called Continuity of Operations Plan (COOP). COOP plans primarily pertain to industrials and organizations that are identified in the Presidential Policy Directive 21 (PPD-21). In that policy directive a large number of industrials that are critical for the continued operations of the country and the Federal government. Some of these sectors are the energy sector, chemical sector, communications sector, dams sector, defense sector, and the nuclear reactors sector. The country relies on these industries every day. The National Infrastructure Protection Plan (NIPP, 2013) outlines how the federal government and private sector will work together to manage the risk and become more secure and resilient (NIPP, 2013). The significant differences between a BCP and a COOP are that BCP is primarily to the private sector and for businesses whereas COOP plans are for government organizations and designated private sectors (Swanson, 2010). COOP plans involve more government oversight the BCP.
The resources and information available to IT security professionals are diverse. A well-known source for the guide and best practices is the National Institute of Standards and Technology. The Special Publication 800 Series are all focused on cyber and information security. The three primary guides that will be used to help develop BCPs or COOPs are the NIST SP 800-53 Security and Privacy Controls for Federal Information Systems and Organizations, NIST SP 800-82 Guide to Industrial Control Systems (ICS) Security, and the NIST SP 800-3 Contingency Planning Guide for Federal Information Systems. All three guides are excellent for helping to develop business continuity and disaster recovery plans.
Industrial Control Systems-Cyber Emergency Response Teams (ICS-CERT) is another excellent resource for ICS. ICS-CERT provides resources and assessments to help identify vulnerabilities in ICS networks. The organization also releases alert and advisories that work similar to the Common Vulnerabilities and Exposures (CVE) and National Vulnerabilities Database (NVD) from NIST. Now we will look at a case study of a cyber attack on the electrical grid in the Ukraine.
Just before Christmas on December 23, 2015, a Ukrainian electrical company reported power outages to their customers. Over 30 electrical substations went offline for three hours and left over 200,000 customers without power. In the aftermath of the event and the investigation, it was determined that someone has entered the network and had taken the three largest substations down one right after the other with about 30 minutes between each other. After the investigation, the Ukrainian government reported it as a cyber attack. The Department of Homeland Security (DHS) acknowledged it and issued a formal report on February 25, 2016, and listed the event as IR-ALERT-H-16-056-01 (ICS-CERT, 2015). The power outage only lasted about three hours, but it took a month for the control systems to come 100% back online. This is a historical type event because this cyber attack was the first attack that caused a loss of power (Bodungen, 2017). The attack came from malware that was planted by hackers. The malware is known a BlackEnergy and is classified as known crimeware. Two files called Devlist.cim and Config.bak were installed on the operating systems that control the SCADA software. Those two files are known to kill critical parts of the operating system. Once the operating systems were compromised the ICS devices locked up, and the SCADA software stopped working causing the blackout (Bodungen, 2017). The vector that the hackers took in gaining access to the systems is not known, at least not publicly. However, there are an almost infinite possible number of ways that hackers could enter a system if the computers and other ICS devices are not updated.
While there is not much public information about this attack, there are two things to consider from a BCP perspective. The first is the initial incident. In the case of a power company losing power, it is bad for the company and their customers. If the power company has to compete for customers because there are several companies to choose from this could be a major impact on their bottom line.
The next issue is related to the architecture of the network and how it is segmented. It is well known in the IT world that if hackers can gain access to one computer or admin account they can move laterally thru the network and start chipping away at parts of the network that they do not have access to. Eventually, if the hacker works long enough, they can compromise the entire system.
Some of the significant data breaches in the last few years had shown that hackers have had access to the networks for months before they started causing damage or they got compromised. In the case of ICS networks, it works the same way but with longer lasting damages. What usually happens if a hacker is attempting to bring down an ICS network is that they end up trying to make a change to the actual control systems so that they can damage or destroy the equipment. Sometimes the outcome is immediate, and sometimes it can take days or weeks for the damage to get significant enough to be noticeable. If hackers had been on the ICS networks for some time, they could have already caused lasting damage to the physical equipment before they were compromised or caused noticeable damage. So, in the aftermath of an attack on an ICS network, the damage could be much more significant than was previously thought. The entire segmented network will have to be assumed to be compromised and all of the equipment inspected. Even if only a single pump were damaged, every pump on that network would have to be inspected for damage and potentially replaced even though the pumps did not show a catastrophic failure.
Since the Stuxnet attack, the field of ICS security has grown. Unfortunately, the attacks will probably on get greater. The unique thing about ICS network attacks is that the adversaries or hacker that are attacking are very knowledgeable and know what they are doing. You will most likely not find a script kiddie attacking an ICS network. The types of hackers that would attack ICS networks are state-sponsored hackers or some insider that has much knowledge about the networks. These are the types of Advanced Persistent Threats (APT) that most security professionals worry about.
The second unique thing about ICS networks is that when damage is done to the control systems, there is physical damage. Physical damage does not go away like the damage that can be done to computers and their data. There is no way to backup control systems and manufacturing components. When the damage is done, those units will likely have to replaced immediately or well before their life expectancy.
It is said that ICS network security is about a decade behind in the world of IT security (Antova, 2017). The Morris Worm is considered the first malware that caused interruptions on the internet and it occurred in 1988 (Kehoe, n.d.). The Morris Worm was a watershed moment in history for IT security. Stuxnet only happened eight years ago, but it will probably be seen later as a watershed moment for the ICS network security.
Resources
Zetter, Kim. (2014). Countdown To Zero Day. Broadway Books: New York
Zetter, Kim. (January 2015). A Cyberattack Has Caused Confirmed Physical Damage For the Second Time Ever. Retrieved from https://www.wired.com/2015/01/german-steel-mill-hack-destruction/
Assante, Michael. Conway, Tim. (August 2014). An Abbreviated History of Automation & Industrial Controls Systems and Cybersecurity. Retrieved from https://ics.sans.org/media/An-Abbreviated-History-of-Automation-and-ICS-Cybersecurity.pdf
Lee, Robert. Assante, Michael. Conway, Tim. (March 2016). TLP: White. Analysis of the Cyber Attack on the Ukrainian Power Grid. Retrieved from https://ics.sans.org/media/E-ISAC_SANS_Ukraine_DUC_5.pdf
ICS-CERT. (February 2016). Alert (IR-ALERT-H-16-056-01): Cyber-Attack Against Ukrainian Critical Infrastructure. Retrieved from https://ics-cert.us-cert.gov/alerts/IR-ALERT-H-16-056-01
Bodungen, Clint. Singer, Bryan. Shbeeb, Aaron, Hilt, et al. (2017). Hacking Exposed: ICS and SCADA Security Secrets & Solutions. McGraw Hill Education New York: NY
Stouffer, Keith. Pillitteri, Victoria. Et. Al. (May 2015). Guide to Industrial Control Systems Security Retrieved from http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-82r2.pdf
NIPP. (2013). NIPP 2013: Partnering for Critical Infrastructure Security and Resilience. Retrieved from https://www.dhs.gov/sites/default/files/publications/national-infrastructure-protection-plan-2013-508.pdf
Joint Task Force Transformation Initiative. (April 2013). Security and Privacy Controls for Federal Information Systems and Organizations. Retrieved from http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-53r4.pdf
Cruz, Tiago. Simoes, Paulo, Et. Al. (July 2016). Security implications of SCADA ICS virtualization: survey and future trends. Retrieved from https://www.researchgate.net/publication/305725280_Security_implications_of_SCADA_ICS_virtualization_survey_and_future_trends
Swanson, Marianne. Bowen, Pauline. Et.al. (May 2010). NIST Special Publication 800-34 Rev.1 Contingency Planning Guide for Federal Information Systems. Retrieved from http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-34r1.pdf
Antova, Galina. (August 2017). Overcoming the Lost Decade of Information Security in ICS Networks. Retrieved from http://www.securityweek.com/overcoming-lost-decade-information-security-ics-networks
Kehoe, Brendan. (n.d.). The Robert Morris Internet Worm. Retrieved from http://groups.csail.mit.edu/mac/classes/6.805/articles/morris-worm.html