Every system fails. The question isn’t if — it’s how gracefully. Resilience engineering is the discipline of designing systems that continue to function when components inevitably break.
After two decades of building payment systems that process millions of transactions, I’ve learned one thing: the most expensive failures aren’t the spectacular crashes. They’re the slow degradations that nobody notices until it’s too late.
This article covers the practical patterns for building resilient cloud architectures on AWS and GCP.
The Four Pillars of Resilience
Before diving into cloud services, let’s establish the foundational patterns. Every resilient system is built on these four pillars:
1. Redundancy
Don’t rely on a single instance of anything. If one node fails, another takes over. If one region goes down, traffic routes to another.
2. Isolation (Bulkheads)
Failures should be contained. Named after the watertight compartments in ships, bulkheads ensure that a failure in one part of the system doesn’t cascade to others. If the recommendation engine crashes, the checkout must still work.
3. Graceful Degradation
Serve something, even if you can’t serve everything. When the product image CDN is down, show the page without images. When the recommendation engine times out, show popular items instead of personalised ones.
4. Fast Recovery
Detect failures quickly, recover automatically. Health checks, auto-scaling, and self-healing infrastructure reduce the mean time to recovery (MTTR) from hours to seconds.
Circuit Breakers: The Most Important Pattern
The circuit breaker pattern prevents a failing downstream service from taking down its callers. It works exactly like an electrical circuit breaker:
How It Works
Closed State (Normal): Requests flow through. The circuit breaker monitors failure rates.
Open State (Tripped): When failures exceed a threshold (e.g., 50% of requests fail in 30 seconds), the breaker trips. All subsequent requests are immediately rejected with a fallback response — the failing service doesn’t receive any traffic at all.
Half-Open State (Testing): After a cool-down period, the breaker allows a small number of test requests through. If they succeed, the breaker closes. If they fail, it opens again.
AWS Implementation
On AWS, you can implement circuit breakers at multiple levels:
- Application Level: Use libraries like Resilience4j (Java) or Polly (.NET) in your Fargate/ECS services.
- Infrastructure Level: AWS App Mesh (Envoy proxy) provides circuit breaking as a sidecar — no code changes required. Configure connection limits and outlier detection in your mesh policies.
- API Level: API Gateway can return cached responses when backends are unhealthy, acting as a crude but effective circuit breaker.
GCP Implementation
- Application Level: Same libraries (Resilience4j, Polly) in your Cloud Run services.
- Infrastructure Level: Traffic Director with Envoy provides circuit breaking for GKE-based services Alternatively, Istio on GKE gives you the full service mesh with circuit breaker policies.
- Serverless Level: Cloud Run’s built-in request concurrency limits act as a natural bulkhead — a single instance can’t be overwhelmed.
Multi-Region Resilience
For systems that must survive regional outages (like payment processing), you need multi-region architecture.
AWS: Active-Active with Route 53
| Component | Service | Role |
|---|---|---|
| DNS Routing | Route 53 | Health-checked failover between regions |
| Database | DynamoDB Global Tables | Multi-region, multi-active replication |
| Compute | ECS/Fargate in multiple regions | Independent service instances |
| Queue | SQS per region | Regional event processing |
| Storage | S3 Cross-Region Replication | Data durability |
Pattern: Route 53 performs health checks against your API in both regions. If ap-southeast-2 (Sydney) fails health checks, traffic automatically routes to us-east-1. DynamoDB Global Tables keep data in sync with sub-second replication.
GCP: Global by Default
GCP has an architectural advantage here — many services are natively global:
| Component | Service | Role |
|---|---|---|
| Load Balancing | Global HTTP(S) LB | Anycast IP, automatic regional routing |
| Database | Cloud Spanner | Globally distributed, strongly consistent |
| Compute | Cloud Run (multi-region) | Deploy to multiple regions simultaneously |
| Messaging | Pub/Sub | Natively global message delivery |
| Storage | Cloud Storage (multi-region) | Automatic geo-redundancy |
Pattern: A single Global Load Balancer with an Anycast IP routes users to the nearest healthy region. Cloud Spanner provides the holy grail: global distribution with strong consistency. No conflict resolution, no eventual consistency headaches.
Key Difference: AWS requires you to explicitly build multi-region. GCP’s primitives are often global by default. However, AWS gives you more control over exactly how failover behaves.
Chaos Engineering: Testing Resilience
You can’t know if your system is resilient unless you actively try to break it.
AWS Fault Injection Service (FIS)
AWS has a first-party chaos engineering tool:
- Inject failures into EC2, ECS, RDS, and networking.
- Experiment templates let you define: “Kill 30% of my containers in ap-southeast-2 and verify that response times stay under 500ms.”
- Safety controls automatically stop experiments if real customer impact is detected.
GCP: Open-Source Approach
GCP doesn’t have a first-party chaos tool, but the ecosystem is rich:
- Litmus Chaos (CNCF project) runs chaos experiments on GKE.
- Gremlin (SaaS) supports Cloud Run, GKE, and Compute Engine.
- Load testing with Locust or k6 can simulate degradation scenarios.
Where to Start
- Start in staging. Don’t run chaos experiments in production until your team is mature.
- Define steady state. What does “healthy” look like? Latency under 200ms? Error rate under 0.1%?
- Run the simplest experiment. Kill one container instance. Does traffic reroute?
- Observe and learn. The failures you discover in controlled experiments are infinitely cheaper than discovering them during a real outage.
Self-Healing Patterns
The best resilient systems don’t just survive failures — they fix themselves.
Auto-Scaling as Self-Healing
| Scenario | AWS | GCP |
|---|---|---|
| Traffic spike | ECS Service Auto Scaling | Cloud Run automatic scaling |
| Instance failure | ASG replaces unhealthy instances | Managed Instance Group auto-repair |
| Database overload | Aurora Auto Scaling read replicas | Cloud SQL auto-scaling storage |
Health Check Cascades
Design health checks that test real functionality, not just “is the process running”:
- Shallow: HTTP 200 from
/health→ The process is alive - Deep: Query database, check cache, verify queue connection → The service can actually do its job
- Dependency: Check all downstream services → The system is fully operational
Use shallow checks for load balancers (fast, frequent) and deep checks for deployment gates (thorough, less frequent).
Practical Recommendations
| Your Situation | Start Here |
|---|---|
| Single-region, tolerating downtime | Add health checks + auto-scaling |
| Need high availability (99.9%) | Implement circuit breakers + graceful degradation |
| Need extreme availability (99.99%) | Multi-region active-active + chaos engineering |
| Regulated industry (finance, health) | All of the above + audit trails + automated failover testing |
The Non-Negotiables
Regardless of your cloud or scale, every production system should have:
- Health checks on every service
- Circuit breakers on every external call
- Timeouts on every network request (never wait forever)
- Retries with exponential backoff (don’t hammer a failing service)
- Dead-letter queues for failed async processing
- Alerting on error rate changes, not just thresholds
Conclusion
Resilience isn’t a feature you add at the end. It’s a design philosophy that influences every architectural decision from day one. The cloud platforms give you the primitives — health checks, auto-scaling, multi-region replication, chaos testing — but you have to choose to use them.
Start with the basics: health checks, timeouts, and circuit breakers. Then graduate to multi-region and chaos engineering as your system matures. The cost of resilience is always cheaper than the cost of a production outage.
Written by Haris Habib from Sydney, Australia | February 2026