|
|
| |
 |
| |
 |
| |
| Business Continuity
Surviving Disaster |
| Felix Mohan, CEO
- SecureSynergy |
| Posted on 17 May
2003 |
| |
| |
|
Introduction
In a world of real time transactions and just-in-time inventory,
losing a phone system may wreak more havoc than a fire in
the building. Every business faces minor downtimes, and major
unknowns; hence it is important to have plans in place that
guarantee business continuity.
Before the September 11 attack, Business Continuity Planning
was considered an expenditure, which did not bring any return
on investment. Events like 9/11 serve as constant reminders
that it is vital for every company to have plans in place
to ensure business continuity, and the continuity of suppliers
and logistics. BCP costs relatively less in comparison to
what the company could potentially lose in a major incident.
Therefore it seems highly prudent that organizations of all
sizes seriously research and develop a plausible and efficient
contingency plan.
Business continuity and disaster recovery planning are now
accepted as basic requirements for every business organisation.
It is widely accepted that a detailed Disaster Recovery Plan
should not only exist, but should be up to date. It should
reflect the actual on-going needs of the business activity
or function.
In recent years, data has become a critical corporate asset
essential to business continuity. The ability to recover crucial
data quickly after a disaster is a fundamental requirement
of economic viability.
With good plans in place, frequently tested, and well audited,
there is every possibility that an organisation will cope
with adverse events and continue in business, satisfying its
customers, meeting its commitments and making a return on
investment.
Contingency planning is considered a part of organization's
risk assessment and other security programs. The team responsible
for contingency planning must be aware of risks to the system
and recognize whether the current contingency plan is able
to address residual risks completely and effectively.
|
| |
 |
| |
|
IT contingency planning represents a broad
scope of activities designed to sustain and recover critical
IT services following an emergency. Contingency planning fits
into a much broader emergency preparedness environment that
includes organisational and business process continuity and
recovery planning.
An organisation would select a plan to properly prepare for
response, recovery, and continuity in case of disruptions
affecting the organization's IT system, business process,
and the facility. Because there is an inherent relationship
between an IT system and the business process it supports,
there should be coordination between each plan during development
and updation to ensure that the strategies and supporting
resources are in line with the contingency objective.
Crisis Management Plan
Crisis Management Plan is designed to maximize human survival
and preservation of property, minimize danger, restore normal
operations, and assure responsive communications. Crisis Management
Plan procedures should be coordinated with all other plans
and can become an appendix to Business Continuity Plan.
Cyber Incident Response Plan
The cyber Incident Response Plan establishes the procedures
to address cyber attacks against an organization's IT system.
This plan details the procedure to enable security personnel
to identify, mitigate, and recover from malicious security
incidents, such as unauthorized access to a system or data,
denial of service, or unauthorized changes to system hardware,
software or data (malicious logic, such as virus, worm or
Trojan horse)
Disaster Recovery Plan
Disaster Recovery Planning involves developing a plan and
preparing for a disaster before it takes place in the hopes
of minimizing loss and ensuring the availability of critical
systems and personnel. It consists of a set of activities
aimed at reducing the likely hood and limiting the impact
of disaster events on a critical business processes.
Other objectives of disaster recovery planning include:
|
| |
 |
Providing a sense of security |
 |
Minimizing risk of delays |
 |
Guaranteeing the reliability of standby
systems |
 |
Providing a standard for testing the plan |
 |
Minimizing decision-making during a disaster |
|
| |
|
More than off-site storage or backup processing,
organizations should develop written, comprehensive disaster
recovery plans that address all the critical operations and
functions of the business. The plan should include documented
and tested procedures, which, if followed, will ensure the
ongoing availability of critical resources and continuity
of operations.
A disaster plan, however, is similar to liability insurance:
it provides a certain level of comfort in knowing that if
a major catastrophe occurs, it will not result in financial
disaster. Insurance alone is not adequate because it may not
compensate for the incalculable loss of business during the
interruption of the business that never returns.
Elements of successful Disaster Recovery Plan Program
Information collection
The business processes of a company need to be identified,
together with the components of IT infrastructure used to
support each process. IT infrastructure components may include
application software, servers, operating systems, data and
storage systems, local and wide area networks and client systems
including PCs and peripherals.
This process also measures the impact of unplanned interruption
on each business process and IT infrastructure.
Risk analysis
The basic steps of risk analysis include;
|
| |
|
|
| |
|
The ability of a company to cope with the
interruption of a business process determines the tolerance
of the business process. In practical terms, tolerance may
be expressed as a rupee value- the cost to the company if
the business process is interrupted for a period of time.
Critical applications are defined as such because, regardless
of duration of the outage or the time of month in which an
outage occurs, there are no substitute methods for providing
the functions of the application. Electronic commerce applications
used by on-line brokers, for example are clearly mission critical.
These critical resources are described as Single Point Failures.
Within any complex system, there are usually components or
processes that, if not replicated or otherwise backed up by
redundant capabilities, represent points of failure for the
entire system. A large part of disaster avoidance planning
comes down to identifying Single Point of Failures, wherever
they exist and eliminating them.
Identify risk based on phenomenon, which includes;
|
| |
 |
Water damage (whether from leaky pipes
or floods) |
 |
Fire (heat) damage (whether from arson,
equipment overheating, environment contamination, lightning
strikes, etc) |
 |
Power failure (originating at the customer
premises or across the power grid) |
 |
Network failure (LAN or WAN, whether component
or link based) |
 |
Mechanical hardware failure or software
failure (whether due to human error, short circuits, normal
parts wear and tear, or building collapse following an
earthquake) |
 |
Accidental or deliberate destruction or
corruption of hardware, software, or data (by hackers,
disgruntled employees, industry saboteurs, terrorists,
or misbehaving software) |
 |
Other causes (including forces evacuation
for environmental hazards, aircraft crashes , etc) |
|
| |
|
Disaster Avoidance System
The purpose of disaster avoidance systems is to provide an
automated mechanism for detecting certain disaster potentials
(and to respond to them, where possible) before they develop
into unplanned interruptions of normal business processes.
These include:
|
| |
 |
Systems for water detection that can provide
early warning of leaks and water-related hazards |
 |
Systems for the detection of pre-ignition
gases, smoke and other indicators of impending fire to
enable proactive response that will ensure the health
and safety of personnel and prevent the loss of data and
equipment to fire |
 |
Systems for the detection of airborne contamination
levels that are associated with employee illness, data
loss, equipment malfunction, and fires |
 |
System for the suppression of fires |
 |
Systems for the continuation of electrical
power in the presence of a utility power outage |
 |
Systems for the physical security of corporate
computing and telecommunication facilities |
|
| |
|
Water Detection
Water can intrude into sensitive information processing and
storage facilities as well as user work areas, in a variety
of ways and from a variety of sources. Some common sources
of flooding are;
|
| |
 |
Facility plumbing leaks |
 |
Air conditioning |
 |
Water cooling systems |
 |
Sprinkler systems |
|
| |
|
Detection systems ranging from simple battery
operated alarms to sophisticated sensing cables and ceiling
grids- are available to detect the presence of water wherever
it is found and either signal an audible alarm or relay hazard
alert message to a system or network management console.
Fire Suppression
Fire prevention begins with facility design and construction.
Fire-resistant construction materials, firewall placement
and facility compartmentalization can play major roles in
limiting the scope, duration and destructiveness of a fire.
Contamination Detection
The entry of airborne contaminant particles into electrical
equipment can cause short circuits and even flash fires in
electronic equipment.
White-glove method: this method assesses the level
of contamination by wiping the exposed surface of a piece
of equipment with a white glove. The particulate on the glove
is analysed to determine the type of contaminant in the center
environment.
Aspirating pump: This method involves the installation
of a pump. Air samples are collected through an air intake,
and the contents are analysed.
Precombustion Detection
Heat and smoke detectors are available in a variety of types,
shapes, and sizes to alert personnel to hazardous conditions,
such as;
|
| |
 |
Photoelectric detectors, which detect the
smoke produced by smoldering fires that involve PVC installation |
 |
Ionization detectors, which detect fires
involving more flame than smoke |
 |
Temperature detectors, which detect heat
in excess of a present value |
 |
Rate of rise heat detectors, which monitor
rates of heat exceeding a preset threshold (useful in
environments subject to significant ambient temperature
changes such as nuclear power generation facilities or
heavy manufacturing environments) |
 |
Air sampling detectors, which detect the
invisible by-products of materials as they degrade during
the pre-combustion stages of fire |
|
| |
|
Power Failure
Interruptions in electrical power can result from a variety
of factors, such as;
|
| |
 |
Transformer failure or line damage |
 |
Natural disasters and damage from severe
weather. |
 |
Utility company outages |
 |
Inadequate power-handling capacity in multi
or single-tenant buildings. |
 |
Sabotage and terrorism |
|
| |
|
Providing alternatives or backups for the
facility power supply is one method to insulate the company
against external conditions that are beyond its ability to
control.
Day -to day power related problems should be addressed such
as line dips and surges, traverse and common-mode interference
or noise and in some areas, brownouts.
Uninterrupted Power Supply affords protection against both
mandatory and prolonged outages.
Additional intelligence is also added by UPS vendors such
as;
|
| |
 |
Simple Network Management Protocol (SNMP)
support, enabling the transmission of information about
UPS status to system or network management software |
 |
Event Logging, providing the means
to store information about power events to facilitate
the troubleshooting of power-related problems
|
 |
Temperature monitoring, capturing temperature
information on UPS components as well as power outlets
and signaling operations if preset thresholds are exceeded.
|
|
| |
| Other
alternatives include |
| |
 |
On-site power generation: Vendors offering
self-generation packages sized to meet specific load
needs that are operated and maintained by the provider
|
 |
Deliverable power generation: For companies
that prefer not to go into the power generation, a second
class of portable power providers sell UPS systems and
wiring services and a contract to appear on site with
a portable generator (or to deliver power from a separate
generating facility) in an emergency or during periods
when utility outages are expected.
|
 |
Electrical power loss insurance: Policies
in certain countries allow an organisation to go for
insurance, which compensates a business customer for
losses incurred during a power outage.
|
|
| |
|
Data Recovery Planning
A successful business recovery comes down to a simple axiom:
Shorten the time to data
For the company experiencing an unplanned interruption in
time-sensitive, mission-critical business processes, the primary
objective is to establish access to application data quickly
and by whatever means possible.
Time to data is a determinant of post-disaster business survival.
Creating strategies to shorten time to data is the primary
mission of disaster recovery planning.
Once provisions have been made to minimize the likelihood
of avoidable disasters, attention turns to developing strategies
to restore infrastructure supports for critical business processes
in the wake of disasters that cannot be effectively avoided.
Restoration speed of data to a usable form is determined by
the sensitivity of the company to the duration of an unplanned
interruption. To address different degrees of sensitivity,
several techniques of data restoration have evolved over time.
These include;
|
| |
 |
Routine data backup to magnetic tape
using backup/restore software and the removal of backups
to off-site storage. It requires the retrieval of stored
backup tapes, transport to system recovery facility,
and restoration of data to a new storage platform via
software
|
 |
Routine data backup to an electronic
"tape vault" via a wide area network interconnect
or the internet. Restoration may require the physical
retrieval of tapes and their transport to a system recovery
site, or it may be possible using a WAN link between
the recovery system and the electronic tape vault
|
 |
Remote mirroring of data to a second
(or third) storage platform via WAN interconnect. Restoration
is unnecessary. Recovery system is connected via WAN
links to the remote mirror array, or the remote mirror
may be located at the system recovery site.
|
|
| |
|
System Recovery
There is interdependence between the centralized system backup
strategies and the data protection strategies, along with
other disaster recovery plan elements.
Once application criticality is defined, the risk analysis
goes further to identify the hardware (both CPUs and storage
devices) used by the application in performance of the critical
or vital business function. During the emergency operations
the business may be able to settle for far less processor
and storage capacity than it normally utilizes.
If critical and vital applications run on several homogenous
or compatible processors in normal business processes, it
may be possible to replace several low-end servers with one
higher end server (server consolidation). Through the use
of right operating system software, even applications that
reside on heterogeneous processors may be able to run in one
processor. Again total capacity requirement, the backup server
and related storage devices may be substantially less than
that of the production environment.
The net result of this analysis is called minimum acceptable
hardware configuration which must be implemented quickly in
the event of a disaster.
Hot Sites
Hot sites are fully equipped IT operations facilities ready
to operate within few hours. They contain the same set of
hardware and software at primary and alternate data processing
site.
Cold Site
By using the cold site strategy, the organisation has already
prepared a facility with the requisite physical capabilities
to serve as an alternate data processing site.
The facility may be used for other purposes, including off-site
storage or new employee training, when not in use for disaster
recovery.
Reciprocal Backup Agreement
In this arrangement, two companies having spare process time
and compatible hardware capabilities agree formally or informally
to backup up each other's critical applications. For example,
in a simple arrangement Company A experienced a disaster,
Company B would allow Company A to restore its critical applications
on Company B's hardware. The reverse would be the case if
Company B had a failure.
Redundant System
In the event of a disaster, redundant systems at a separate
facility, which must be far enough distant so as not to have
been affected by the same disaster, are brought online. Users
are either transported to an operations center that is co-located
to the backup site or are provided remote access to the backup
CPU via a pre-established data communications network.
Service Bureaus
An organisation may contract with a service bureau to fully
provide all alternate backup-processing services. The big
advantage of this type of arrangement is the quick response
and availability of the service bureau, ability to test, and
that the service bureau may be available for more than backup
alone.
Network Recovery
Network recovery plan formulation involves the department
of at least three discrete recovery strategies to cover:
Internal enterprise network (defined as departmental or workgroup
LANs) interconnected via a switched or routed backbone network,
as well as separate or converged telephony networks.
"Local Loop" (Local Exchange Carrier services connecting
the company facility to the LEC central office) and WAN Network
Relocation, providing a means to rebuild mission-critical
internal network services and to reroute WANs and telephony
services to alternate end user and/or systems recovery sites
in the wake of a disaster.
To assist in formulating effective strategies, it may be useful
to define a loss scenario that will guide planning to assist
in the development of internal network recovery strategy;
for example, DR co-coordinators may wish to use a scenario
of media or equipment failure. This scenario-based approach
has the benefit of enabling flexible response to network interruptions
of different kinds. It also provides a basis for analyzing
and implementing preventive measures to protect against certain
types of outages.
Strategies for End User Recovery
The End user recovery includes;
|
| |
 |
The location and provisioning
of backup end user work facilities |
 |
The notification of employees
who will staff the recovery site |
 |
The transportation of employees
to the recovery site |
 |
The redirection of ground mails,
telecommunications, and data networks to the recovery
site |
 |
The acquisition of supplies
at the recovery site |
 |
The application of remote access
technologies for operating mission critical applications
from the user recovery site |
 |
Emergency evacuation plan for
personals from the corporate premises in the event of
a hazardous or life-threatening disaster event. |
|
| |
|
Testing
Regular testing is required for effective DRP implementation.
The following things should be considered for effective DRP
testing
|
| |
 |
The DRP is tested to the fullest
extent possible |
 |
The associated costs are not
prohibitive |
 |
Service disruptions are minimal
or non-existent |
 |
The tests provide a high degree
of assurance in recovery capability |
 |
Evaluation of test results
provides quality input to DRP maintenance |
|
| |
|
The Cycle Testing Paradigm:
Cycle testing consists of a series of exercises utilizing
multiple methodologies that often increase in complexity and
length from one phase to the next. The results of each test
are assessed individually; improvements and error corrections
are applied to the plan prior to beginning the next phase.
At the end of the cycle the entire plan has been completely
evaluated, in fact many portions of the plan will have been
tested, assessed and updated multiple times. Small logistical
errors that could prove to be major obstacles in full scale
testing are isolated and removed from the plan. The iterative
framework of the test cycle provides continuous DRP evolution.
In the volatile world of Information Technology, hardware
and software upgrades, configuration changes and even business
process life cycling can occur quickly in response to market
demands and new service requirements. Cyclic recovery tests
provide an efficient pathway to DRP maintenance by early recognition
and correction of such problems. At the end of each exercise
and prior to the next, comprehensive debriefing, audit and
analysis are required in order to update the current test
plan as well as each of the following phases of the cycle.
|
| |
 |
| Illustration
of a DRP Cycle Testing Scenario |
| |
|
Checklist testing:
Checklists are the DRP consultant's most valuable tool. They
are inexpensive to implement and maintain and provide the
backbone of the testing cycle. The checklists are team oriented
and if used to their full potential provide multiple benefits.
For each business process, partition out areas of responsibility,
select teams appropriate to the specific nature of the partition
and allow the cumulative experience of the group to develop
the checklist as appropriate. The grassroots involvement heightens
recovery awareness and buy-in as the team members get a sense
that their input is an integral component of the process.
A checklist test can be used to validate multiple components
of the DRP, for example:
|
| |
 |
Emergency Call Tree verification
|
 |
Key procedure validation |
 |
Hardware and software configuration
documentation complete and current |
 |
Availability of process specific
resources during DRP implementation |
 |
Tape backup libraries are complete
and current with existing configuration |
 |
Recovery plan and all necessary
operational manuals |
|
| |
|
Walk Through Testing:
Team members verbally "walk through" the specific
steps as documented in the plan to confirm effectiveness,
identify gaps, bottlenecks or other weaknesses in the plan.
Often used in conjunction with previously validated checklist
plans, this test provides the opportunity to review the plan
with a larger subset of people allowing you to draw upon a
correspondingly increased pool of knowledge and experiences.
Staff will be familiarized with procedures, equipment and
offsite facilities if required.
Simulation Testing:
As a disaster is simulated, normal operations will not be
interrupted. Hardware, software, personnel, communications,
procedures, supplies and forms, documentation, transportation,
utilities, and alternate site processing should be thoroughly
tested in a simulation test. Extensive travel, moving equipment,
and eliminating voice or data communications may not be practical
or economically feasible during a simulated test. However,
validated checklists can provide a reasonable level of assurance
for many of these scenarios.
The simulation test should be considered advanced and only
implemented after the previous checklist and walk through
tests have been validated. The output of the previous tests
should be carefully analyzed before the proposed simulation
to ensure that the lessons learned during the previous phases
of the cycle have been applied.
Parallel testing:
A parallel test can be performed in conjunction with the checklist
test or simulation test. Under this scenario, historical transactions
such as the prior business day's transactions are processed
against preceding day's backup files at the contingency processing
site or hot site. All reports produced at the alternate site
for the current business date should agree with those reports
produced at the alternate processing site.
Full-interruption testing:
A full-interruption test activates the total disaster recovery
plan. The test is likely to be costly and could disrupt normal
operation, and therefore should be approached with caution.
Again, the importance of due diligence with respect to previous
phases of the cycle cannot be overstated.
It is important to note that the test cycle can consist of
one or more of the advanced testing methods. A Test Cycle
should consist of a minimum of three phases, a Checklist,
Walkthrough and at least one of the advanced testing methodsTraining
Training should be provided at least annually; newly recruited
personnel who'll be assigned the planning responsibilities
should receive training shortly after they have joined. Contingency
planning personnel should be trained to execute their recovery
procedures without referring the actual document.
Plan Maintenance
Disaster Recovery Plan may get obsolete if the organisation
may reorganize and the critical business units may be different
than when the plan was first created. Most commonly, changes
the location or configuration of hardware, software, and other
components.
Role of a consultant
Consultant brings specialized knowledge to the planning that
may facilitate the speedy development of an effective plan.
Consultant who works within a specific industry may combine
an understanding of the industry with a methodology for disaster
recovery planning. This reduces the learning curve, in turn
can help to speed plan development. Consultant can also bring
a fresh eye to the project, noticing recovery requirements
that may be overlooked by someone who is too close to the
data center he or she is seeking to protect.
Conclusion
Disaster recovery planning involves more than off-site storage
or backup processing. Organizations should develop written,
comprehensive disaster recovery plans that address all the
critical operations and functions of the business. The plan
should include documented and tested procedures, which, if
followed, will ensure the ongoing availability of critical
resources and continuity of operations.
|
| |
| |
Disaster statistics:
http://www.drplanning.org/
Other references:
http://www.disasterrecoveryworld.com/
http://www.disaster-recovery-plan.com/
http://www.drii.org/index.cfm
http://www.drj.com
http://www.disaster-resource.com/
http://csrc.nist.gov/publications/nistpubs/800-34 |
| |
| |
| |
| Posted on 17 May
2003 |
| |
|
| |
| |
|
|
| |
| |
| |
| |
|
REPRINT INFORMATION
The above article may be reproduced in its entirety
in any medium on the condition that the content remains
unaltered (including author credit) and the following
line is displayed prominently as a link
"Content courtesy: www.securesynergy.com".
Please fill the 'Request for Reprint' form by
clicking here.
|
|
| |
| |
| |
 |
| |
| |
| |
| |
|
|