APRA CPS 230: The 90-Day Engineering Framework

Most CPS 230 programs can survive a steering committee. Far fewer can survive a real outage.
CPS 230 took effect on 1 July 2025.
For pre-existing service-provider contracts, the catch-up deadline is the earlier of renewal or 1 July 2026.
The real question now is no longer “Do we have a framework?” It is “Can engineering prove it under pressure?”

That is the gap many institutions are still carrying.

Most CPS 230 commentary is written for Boards, CROs, resilience teams, and operational risk functions. That makes sense. The standard is governance-heavy by design.

But governance is only half the story.

The quality of a CPS 230 program is now limited by the quality of the engineering evidence underneath it. Risk teams can often describe the framework well. Engineering teams still cannot always show how a critical operation maps to systems, what recovery target really applies, which incidents trigger APRA notification, or what evidence a Board can safely rely on.

That is where the 5-star version of CPS 230 starts to crack.

APRA’s CPS 230 Operational Risk Management requires entities to identify critical operations, set tolerance levels, maintain credible business continuity capability, manage material service providers, and test under severe-but-plausible conditions. APRA’s CPG 230 guidance makes the technical translation even clearer: the maximum period of disruption maps naturally to RTO, and the maximum extent of data loss maps naturally to RPO.

That means CPS 230 is not only a governance uplift.

It is an engineering operating model.

CPS 230's Technical Gap

The 5-star CPS 230 trap

A 5-star CPS 230 program is one where nothing obviously bad has happened yet.

There is a policy. There is a committee. There is a Board pack. People feel moderately reassured.

But if a director, regulator, or crisis lead asks four simple questions, the confidence starts to wobble:

What is the actual RTO for this critical operation?
What is the actual RPO?
Which incidents trigger the 72-hour APRA clock?
Which failures push the business outside tolerance and into the 24-hour clock?

That is why the dangerous CPS 230 program is the one that looks finished in PowerPoint.

A 10-star CPS 230 program feels different. It feels like the hard conversation has already been rehearsed. Risk can ask the question. Engineering can show the map. Operations can explain the fallback. The Board can see the evidence.

That is the standard worth aiming for in a regulated environment.

The engineering translation that actually matters

The most common CPS 230 failure pattern is that tolerance levels stay in business language for too long.

They need to become technical commitments.

Here is the translation that matters:

CPS 230 concept	Engineering translation	Evidence that should exist
Critical operation	Service and dependency map	Application inventory, runbooks, owner map
Maximum period of disruption	RTO	Failover and restore test results
Maximum extent of data loss	RPO	Backup, replication, and replay evidence
Minimum service levels	Degraded-service target	Manual fallback and partial-service runbooks
Material service provider	Dependency and concentration view	Contract controls, access rights, exit and contingency evidence
Operational risk incident	Regulatory decision tree	72-hour and 24-hour escalation workflow

This is where many institutions still have a credibility gap.

The Board may have approved a tolerance level. But if engineering cannot show the service map, the recovery path, the restore timing, the manual fallback, and the supplier dependency chain underneath that statement, then the tolerance is still more aspiration than operating fact.

That is why the right question for risk officers is no longer “Do we have a CPS 230 framework?”

It is “Can engineering prove we stay within tolerance, or recover predictably when we do not?”

What 10-star looks like in practice

The move from 5 stars to 10 is not about writing more. It is about making resilience legible under pressure.

Area	5-star posture	10-star posture
Critical operations	Listed in policy	Mapped to real systems, people, suppliers, and fallback paths
Tolerance levels	Approved in principle	Translated into RTO, RPO, and degraded-service targets
Incident handling	Severity labels only	Severity plus regulatory trigger logic
Scenario testing	DR exercise theatre	Severe-but-plausible failure rehearsal
Board oversight	Narrative reassurance	Evidence pack with gaps, proof, and remediation status

In a 10-star CPS 230 program:

risk and engineering share one vocabulary
incident responders know when they are inside or outside tolerance
service-provider dependency is visible before a crisis, not during one
Board confidence is earned through current evidence, not presentation quality

That is a much higher bar. It is also a much safer one.

The pathway in one view

If you want one visual that explains the whole article, it is this:

Identify the critical operation.
Translate tolerance into RTO, RPO, and degraded-service expectations.
Decide whether an incident is material, outside tolerance, or both.
Test the ugly scenarios, including supplier failure.
Turn the results into evidence a Board can challenge and rely on.

That is the pathway from policy comfort to operational proof.

The CPS 230 Pathway from Policy to Proof

The 90-day engineering framework

This is the fastest useful way to close the gap without turning the program into another documentation factory.

Days 1-30: Build the critical operations recovery map

The first month is about forcing specificity.

For each critical operation, engineering, architecture, operations, and risk should produce one working map that shows:

the applications and platforms involved
the key databases and data stores
upstream and downstream integrations
infrastructure and network dependencies
identity and access dependencies
manual fallback steps
internal support teams
third-party and fourth-party dependencies

This is where hidden weaknesses appear. Many institutions have a business continuity plan. Far fewer have a trustworthy recovery map showing what actually has to recover, in what order, and with what minimum viable capability.

At the same time, translate tolerance levels into technical targets.

If a critical operation can only be unavailable for two hours, engineering should be able to point to the architecture, staffing model, failover pattern, and runbook discipline that make a two-hour RTO credible. If only minimal data loss is acceptable, the backup and replication design should support that RPO in practice, not only in a policy statement.

By day 30, you want:

a critical operations register connected to real systems and suppliers
an RTO and RPO view for every critical operation
defined degraded-service expectations
a gap list where current capability does not support approved tolerance

If the gap list is uncomfortable, good. That means you finally have the real problem on the table.

Days 31-60: Rebuild incident classification around regulatory triggers

This is where CPS 230 becomes operationally sharp.

Most engineering teams already classify incidents by severity. But those taxonomies are usually built for restoration speed, not prudential reporting.

CPS 230 adds a second lens: regulatory materiality.

Under the standard, an entity must notify APRA as soon as possible, and no later than 72 hours, after becoming aware of an operational risk incident that is likely to have a material financial impact or a material impact on its ability to maintain critical operations. It must notify APRA as soon as possible, and no later than 24 hours, if it suffers a disruption to a critical operation outside tolerance.

That means responders need more than Sev 1 and major incident.

They need a decision path that answers:

Is a critical operation affected?
Is the operation outside tolerance?
Is the likely impact material financially or operationally?
Is this a near miss or control failure that should change our risk view even if APRA notice is not triggered?

High-Level CPS 230 Incident Escalation Sequence

This is also the right window to test severe-but-plausible scenarios, which APRA expects in practice. Basic disaster recovery theatre is not enough.

The better test set includes:

cloud control-plane outage
identity provider failure
corrupted restore
managed service provider loss
cyber containment that forces manual operations
a dependency failure that leaves the platform “up” while the critical operation is unusable

By day 60, you want:

incident severity mapped to CPS 230 notification logic
named owners for 72-hour and 24-hour assessments
evidence capture built into incident response
scenario results with remediation owners and deadlines
near misses feeding the operational risk profile instead of disappearing into folklore

If your incident process cannot separate “major outage” from “outside tolerance”, your program is still too theoretical.

Days 61-90: Build the Board evidence pack

This is where the program either becomes defensible or starts to wobble.

CPS 230 does not create a standalone engineering attestation. But it absolutely creates a Board evidence problem.

Boards are expected to approve tolerance levels, oversee resilience, review failures to remain within tolerance, and understand the risk created by material service providers. In parallel, broader annual declarations under CPS 220 Risk Management still depend on management being able to show the framework is working in reality.

That means engineering evidence is no longer optional support material. It is part of what makes Board reliance reasonable.

The CPS 230 Evidence Loop

By day 90, the evidence pack should include:

the critical operations register and supporting service maps
approved tolerance levels translated into RTO, RPO, and degraded-service targets
current recovery capability versus required tolerance
severe-but-plausible scenario results
incidents and near misses relevant to critical operations
remediation status for missed targets and failed controls
material service-provider dependencies, concentration risks, and contingency options
backup, restore, failover, and manual-operation evidence

This is the level where Board challenge gets better.

Instead of asking whether the institution “has” a resilience framework, directors can ask:

Which critical operations are closest to breaching tolerance?
Which suppliers create the hardest recovery dependency?
Where is RTO or RPO least credible against approved tolerance?
Which repeat incidents suggest control weakness, not bad luck?
What evidence supports management confidence today, not six months ago?

That is a much healthier conversation.

The 10-star test for CPS 230

If you want a simple quality gate, use this:

In under 10 seconds, a risk leader should be able to see the critical operation, the owner, the tolerance, and the current concern.
In under 10 minutes, engineering should be able to show the dependency map, recovery path, and most recent test evidence.
In under an hour, the institution should know whether it is dealing with a major incident, a 72-hour APRA notification scenario, or a 24-hour outside-tolerance event.
By the next Board or executive review, the evidence pack should show what failed, what changed, who owns the gap, and when it will be retested.

That is what a confidence-building CPS 230 operating model looks like.

The tactical checklist engineering teams actually need

If you want the engineering version in one view, start here:

Map every critical operation to applications, data stores, integrations, infrastructure, people, and third parties.
Translate each tolerance level into RTO, RPO, and degraded-service expectations.
Identify single points of failure across identity, data, networking, and service providers.
Add incident logic for material impact, outside tolerance, and APRA escalation.
Capture near misses and repeat control failures, not only customer-visible incidents.
Test cloud, identity, restore, cyber, and provider-loss scenarios that would genuinely hurt.
Prove backup, restore, and failover timing against the approved target, not the vendor brochure.
Review material service-provider contracts for service levels, APRA access, exit rights, subcontracting, and contingency value.
Track remediation to closure with named owners, dates, and retest evidence.
Package the outputs in a Board-ready format that can support oversight and annual declarations.

What the 30 April 2026 amendments do and do not change

APRA’s 30 April 2026 targeted amendments update matters, but less than some teams hope.

The change created a narrow exemption where strict contractual terms may be impracticable for certain non-traditional service providers. That is useful at the margin.

It does not remove the core engineering burden.

You still need to know which providers matter, how concentration risk behaves, how manual or alternative arrangements work, and what happens when a provider becomes unavailable at exactly the wrong time.

In other words, the amendment may soften one contracting edge case. It does not soften the resilience test.

The real point of CPS 230

The point of CPS 230 is not better language.

It is better resilience.

Not a calmer policy library. Not a prettier dashboard. Not a stronger committee rhythm.

Actual resilience.

The firms that do well here will not be the ones with the prettiest policy suite.

They will be the ones that can show, with evidence, how engineering operationalises resilience inside approved tolerance.

If you are a risk officer, CIO, engineering leader, or operational resilience owner, where is your program still 5-star today: recovery mapping, incident classification, supplier dependency, or Board proof?