BCP vs DRP — What Each Covers
These two plans are closely related but cover different scopes. The Business Continuity Plan (BCP) is the broader document — it covers how the entire organisation continues operating during a disruption, including non-IT functions. Staff communication, temporary office locations, manual business processes when systems are unavailable, vendor relationships, and customer communication all fall under BCP. The BCP exists to answer: "If our building is unavailable or our systems are down, how do we keep the business running?"
The Disaster Recovery Plan (DRP) is specifically about IT systems restoration. It documents the technical steps to restore systems, data, and network connectivity after a disaster. DRP defines which systems are restored first (based on business priority), from which backup location, using which recovery procedures. The DRP is a subset of the BCP — it covers IT recovery specifically within the broader business continuity framework.
On the exam: a question about "how long the organisation can operate while IT systems are down" is about BCP. A question about "restoring the database server after a ransomware attack" is about DRP. A question about "the maximum time to restore email service after an outage" is about RTO within the DRP.
RTO and RPO — The Two Critical Metrics
| Metric | Full Name | Definition | Example |
|---|---|---|---|
| RTO | Recovery Time Objective | Maximum acceptable time systems can be down | "Email must be restored within 4 hours of failure" |
| RPO | Recovery Point Objective | Maximum acceptable data loss (measured in time) | "We cannot lose more than 1 hour of transaction data" |
| MTBF | Mean Time Between Failures | Average time a system operates before failing | "The SAN has an MTBF of 50,000 hours" |
| MTTR | Mean Time To Repair | Average time to restore system after failure | "Database server MTTR is 2 hours" |
RTO determines how fast recovery must happen. An RTO of 4 hours for email means that if email goes down at 9am, it must be restored by 1pm. This drives decisions about hot vs warm vs cold standby systems — a very short RTO requires systems that can take over immediately (hot standby), while a longer RTO allows time to provision replacement systems (cold standby).
RPO determines how much data can be lost. An RPO of 1 hour means backup frequency must be at most 1 hour — if the system fails at 10am and the last backup was at 9am, that's 1 hour of potential data loss, which just meets the RPO. An RPO of zero means no data loss is acceptable, which requires synchronous replication — every write is simultaneously written to a replica before the transaction is acknowledged.
The relationship between RTO/RPO and cost is inverse: shorter RTO and RPO require more expensive infrastructure (hot standby sites, synchronous replication, redundant systems) and operational investment. Business leadership defines acceptable RTO and RPO values by weighing the cost of recovery infrastructure against the cost of extended downtime or data loss — a core business continuity governance decision.
Backup Types — Full, Incremental, Differential
The exam tests this in calculation scenarios: "A company runs a full backup Sunday, incremental backups Monday through Saturday. The system fails Saturday evening. How many backup sets are needed for recovery?" Answer: 7 — the Sunday full plus all 6 daily incrementals. If they used differential instead: only 2 — the Sunday full plus Saturday's differential (which contains all changes since Sunday).
Recovery Site Types
| Site Type | Infrastructure | Data Currency | RTO | Cost |
|---|---|---|---|---|
| Hot Site | Fully equipped, powered, and running — identical to primary | Real-time or near-real-time replication | Minutes to hours | Highest |
| Warm Site | Equipment present and powered but not fully configured — requires setup | Recent backups — hours to days old | Hours to days | Moderate |
| Cold Site | Empty space with power and connectivity — no equipment | Off-site backups must be shipped in | Days to weeks | Lowest |
| Cloud DR | On-demand provisioning — no permanent infrastructure | Configurable — near-real-time to hours | Hours (automation dependent) | Variable (pay-as-you-go) |
Hot site: A fully operational duplicate of the primary data centre, with all systems running and data replicated in real time. When a disaster occurs, traffic is redirected to the hot site and operations continue with minimal interruption. Hot sites are extremely expensive because they require maintaining a full duplicate infrastructure indefinitely, but they provide the shortest possible RTO.
Warm site: Equipment is present and powered but systems aren't fully configured or running production workloads. When a disaster occurs, administrators activate and configure the warm site systems, restore from the most recent backup, and bring services online. RTOs of several hours to a day are typical. Warm sites balance cost and recovery speed for organisations that can tolerate hours of downtime but not days.
Cold site: A facility with power, cooling, and network connectivity, but no equipment. After a disaster, the organisation ships hardware to the cold site, sets it up, restores from backup, and brings systems online. This can take days to weeks. Cold sites are appropriate for non-critical systems with long RTOs and tight budgets.
The 3-2-1 Backup Rule
The 3-2-1 backup rule is the industry-standard guideline for backup strategy: keep 3 copies of data (the production copy plus 2 backups), on 2 different types of media (disk and tape, or local and cloud), with 1 copy stored off-site. This protects against: single backup failure (3 copies), a storage type failure like NAS controller failure affecting all connected drives (2 media types), and site-level disasters like fire or flood (1 off-site copy).
Ransomware has updated this to the 3-2-1-1-0 rule: the additional "1" is an immutable or air-gapped copy that ransomware cannot reach (offline tape, immutable cloud storage with object lock), and the "0" means zero errors verified in backup restore testing. Untested backups are not reliable backups — organisations routinely discover their backup sets are corrupt or incomplete only when they need them after a disaster.
Exam Scenarios
Cloud-Based Disaster Recovery
Cloud infrastructure has transformed disaster recovery by dramatically lowering the cost of recovery site capabilities. What previously required significant capital investment in physical hot site infrastructure can now be provisioned on-demand in the cloud, paying only for the infrastructure when it's actually needed.
Pilot light DR is a cloud DR approach where the minimum critical infrastructure (databases, core servers) is replicated to cloud and kept running at minimal scale, with the rest of the environment defined but not running. In a disaster, the environment is rapidly scaled up by launching additional compute instances, load balancers, and services from the pre-defined templates. This is cheaper than a full hot site because non-critical components aren't continuously running, but recovery time is longer than a full hot site because spinning up additional infrastructure takes time (typically 30–60 minutes).
Warm standby in cloud keeps a fully functional but reduced-capacity version of the production environment running in cloud, scaled down to minimum viable capacity. In a disaster, autoscaling is triggered to bring capacity up to full production scale. Recovery time is faster than pilot light because all components are already running — only scaling is required, not provisioning from scratch.
Cloud DR provides geographic flexibility — you can replicate to any of dozens of cloud regions worldwide, ensuring geographic diversity without owning physical infrastructure in multiple locations. It also enables regular DR testing at lower cost: spinning up a cloud DR environment for a 2-hour test once a quarter is operationally simple and costs only the hourly compute charges for the test period, versus the logistics of activating a physical DR site.
Order of Restoration
When recovering from a disaster that affects multiple systems simultaneously, order of restoration is the sequence in which systems are brought back online. Not all systems are equal — restoring them in the wrong order can prevent dependent systems from functioning or create security gaps. The BIA (Business Impact Analysis) drives the restoration order by identifying criticality rankings and interdependencies.
General principles for restoration order: infrastructure first — networking (DNS, DHCP, Active Directory) must come up before application servers can function. A web server that boots before DNS is restored cannot be accessed by name; an application server that starts before Active Directory is restored cannot authenticate users. Core infrastructure forms the foundation that everything else depends on. Next, restore critical business systems in order of their RTO requirements — payment processing before the internal wiki, customer-facing services before internal reporting tools. Finally, restore non-critical systems during the cleanup phase once operations have resumed.
The restoration order is documented in the DRP and tested during exercises. Common recovery mistakes include restoring application servers before their database backends are ready (resulting in application errors), restoring systems without verifying data integrity from backups (restoring corrupted backups extends the outage), and failing to verify that restored systems can communicate with each other before going live.
Mean Time to Repair and Mean Time Between Failures
Two metrics appear on both A+ and Network+ exams in the context of hardware reliability and service availability planning: MTBF (Mean Time Between Failures) and MTTR (Mean Time to Repair).
MTBF is the average time a component or system operates without failure. A hard drive with an MTBF of 500,000 hours doesn't mean it will last that long — it's a statistical measure derived from failure rates across large populations of drives under specific conditions. MTBF is used to predict component reliability, justify redundancy (low MTBF = more critical to have spares), and set maintenance schedules. Higher MTBF = more reliable. For availability calculations: Availability = MTBF / (MTBF + MTTR). A system with a 1,000-hour MTBF and a 10-hour MTTR has 99% availability.
MTTR is the average time required to restore a failed system to operational status, including diagnosis time, parts procurement, repair, and testing. MTTR drives the selection of recovery site type and backup strategies — an organization with an MTTR of 4 hours for their database server needs to ensure their RTO is greater than 4 hours, or they need to invest in reducing MTTR (spare parts on hand, hot standby systems, better documentation).
For exam scenarios: if a question asks which metric measures hardware reliability, the answer is MTBF. If it asks which metric measures how quickly a system can be restored, the answer is MTTR. If it asks for a formula combining them for availability, that's Availability = MTBF / (MTBF + MTTR).
Testing and Maintaining BCP/DRP
A disaster recovery plan that has never been tested is a hypothesis, not a recovery capability. Testing is how organizations discover gaps — misconfigured backups that don't restore, runbooks with outdated server names, staff who don't know their roles — before a real disaster exposes these failures. The Security+ exam tests the different types of BCP/DRP tests in ascending order of thoroughness and cost.
| Test Type | What Happens | Disruption | Exam Notes |
|---|---|---|---|
| Tabletop Exercise | Key stakeholders sit around a table and verbally walk through the plan in response to a simulated scenario. No systems are actually touched. | None | Best starting point for new plans. Identifies gaps in procedures, roles, and communications without risk. |
| Walkthrough / Structured Walk-Through | Each team member reviews their section of the plan in detail, checking contact lists, procedures, and resource requirements for accuracy. | None | Often combined with tabletop. Ensures plan documentation is current and accurate. |
| Simulation Test | A simulated disaster scenario is presented. Teams respond as they would in a real event but don't activate actual recovery systems. | Minimal | Tests communication and decision-making chains under realistic pressure. |
| Parallel Test | DR systems are activated and data is restored. Both primary and DR systems run simultaneously. Production is not switched over. | Low | Confirms DR systems actually work and data can be restored. Safest form of live DR testing. |
| Full Interruption Test | Production systems are deliberately brought down and full failover to DR systems is executed as it would be in a real disaster. | High | Most realistic but highest risk — production is offline. Rarely performed outside of critical infrastructure. |
High Availability Concepts
Business continuity isn't only about recovering from disasters — it's also about designing systems to avoid unplanned outages in the first place. High Availability (HA) refers to system designs that eliminate single points of failure, ensuring continuous operation even when individual components fail.
Redundancy is the foundation of HA — having duplicate components so that if one fails, another takes over. At the power level, redundant PSUs ensure a server survives a power supply failure. At the storage level, RAID arrays provide redundancy against disk failures. At the network level, redundant switches, uplinks, and ISP connections prevent network outages from single-device failures.
Clustering connects multiple servers so they operate as a single logical unit. In an active-active cluster, multiple nodes share the workload simultaneously. If one node fails, the remaining nodes absorb its traffic. In an active-passive cluster, one node handles all traffic while a standby node monitors and takes over if the primary fails. Active-active provides better resource utilization; active-passive provides simpler failover logic.
Geographic redundancy replicates critical systems across multiple physical locations. This protects against facility-level disasters — a fire, flood, or power outage at one site doesn't affect operations at the other. Cloud providers achieve this through availability zones (separate power, networking, and cooling within a region) and regions (geographically distinct data center clusters). The exam expects you to know that active-active multi-region deployments provide the highest availability but at the highest cost and complexity.
Data Replication and Recovery Concepts
Beyond backup and restore, enterprise continuity planning involves continuous data replication strategies that minimize RPO. Synchronous replication writes data to both the primary and DR storage simultaneously — a write is not acknowledged to the application until both locations have confirmed it. This provides near-zero RPO (no data loss) but adds latency to every write operation and requires low-latency network between sites. Used for the most critical data (financial transactions, healthcare records).
Asynchronous replication writes to the primary immediately and replicates to the DR site shortly afterward — with a lag of seconds to minutes. This introduces a potential data loss window equal to the replication lag (which becomes the effective RPO). Asynchronous replication has lower impact on application performance and works over higher-latency connections between geographically distant sites, making it the practical choice for most business applications.
For exam scenarios: if the question specifies zero data loss is required, synchronous replication is the answer. If the question specifies a small RPO (say, 5 minutes) with a geographically distant DR site, asynchronous replication is more realistic. If the question asks about backup and recovery, remember the backup type determines recovery complexity: full backup alone is simplest but largest; incremental is smallest and fastest but most tapes/sets required for recovery; differential grows over the week but only needs two sets for recovery.