Blue/Green vs Canary Deployments

What This Article Is About

You wrote new code. Tests pass on your laptop. Now you have to put it on production servers where real users live. The moment between "old code is running" and "new code is running" is where most production incidents happen.

A deployment strategy is a recipe for making that transition safer. Some strategies are simple but cause downtime. Some have zero downtime but cost twice the infrastructure. Some let you catch bugs on 1% of users instead of 100%. None of them are free.

This guide walks through every major strategy, the trade-offs, when to use each, and the parts everyone gets wrong (database migrations, health checks, rollback automation).

The Core Goal

Three things, in order of importance:

1. Don't break production. If the new code has bugs, fewer users should hit them.
2. Don't take it down. Users should not see error pages while you deploy.
3. Make rollback fast. When (not if) something goes wrong, you want to be back on the old version in seconds, not hours.

Every strategy below is a different attempt to balance these three. The "best" strategy depends on your traffic, your team, your tooling, and how scared you are of the change.

The Mental Model: Two Things Are Changing

When you deploy, two things change at the same time:

The code running on your servers.
The traffic hitting that code.

Every deployment strategy is a different way of separating these two. You can change the code first and then send traffic to it later (blue/green). You can change them gradually together (rolling). You can change the code without sending any real traffic and just test it (shadow). Once you see deployment as "code change" plus "traffic shift", the strategies become easy to compare.

Strategy 1: Recreate

Stop the old version. Start the new version. The simplest possible deployment.

Recreate (Stop, Then Start)

Step 1

App v1 (running)

stop

Step 2 (DOWNTIME)

No app running

start

Step 3

App v2 (running)

When it makes sense: dev environments, staging, internal tools, scheduled maintenance windows. Anywhere a few minutes of downtime is acceptable.

When it doesn't: any system real users depend on. Even small B2C apps can lose money during downtime, and "just announce a maintenance window" stops scaling once you have users in different time zones.

The dirty secret: a lot of small companies still do this in production by accident. They have one server, they SSH in, they pull the new code, they restart. That is recreate. It works until it doesn't.

Strategy 2: Rolling Deployment

You have many instances of your app (10 pods, 20 VMs, etc.). Replace them gradually, a few at a time, until all are running the new version. The service stays up the whole time because most instances are still serving traffic.

Rolling Deployment Over Time

T0: All v1

T1: Rolling

T2: All v2

How it works in practice: Kubernetes Deployments do this by default. You configure maxSurge (how many extra pods can run during rollout) and maxUnavailable (how many can be missing). Typical values: surge 25%, unavailable 0%. So you always have full capacity, just temporarily a bit more.

Pros: no downtime. No extra infrastructure (you might temporarily have a few extra pods, but not a full second copy of everything). Built into most orchestrators. Default behavior in Kubernetes.

Cons:

1. Two versions running at the same time. During the rollout, some users hit v1 and some hit v2. Your code, APIs, message formats, and database schema all need to be backwards-compatible during this window. This sounds easy and is usually the source of every deploy bug ever.

2. Slow rollback. If you find a bug at 80% rolled out, you have to do another rolling deployment back to v1, which takes minutes. Compare to blue/green, which rolls back in seconds.

3. No real validation. Once the rollout starts, you don't get a "pause and confirm" gate by default. The pod was ready (passed health check), so traffic flows.

When to use: default for most production systems. Combine with feature flags for important changes.

Strategy 3: Blue/Green

Run two complete copies of production. Call them blue and green. At any moment, exactly one is "live" (receiving real user traffic). Deploy v2 to the idle environment. Test it. When you're happy, flip the load balancer to point at the new environment. The old one is now idle but still running, so rollback is just flipping back.

Blue/Green Switch

Before Switch

Users

Load Balancer

Blue (LIVE, v1)

Green (idle, v2)

flip LB

After Switch

Users

Load Balancer

Blue (idle, v1)

Green (LIVE, v2)

How the switch works: usually a load balancer config change. Update the target group from blue to green. AWS ALBs do this in seconds. DNS-based switches are slower because of TTLs but conceptually the same.

Pros:

Instant rollback. Something broke? Flip the LB back. Less than a second.
Pre-cutover testing. You can run smoke tests on green with synthetic traffic before flipping real users.
Clean transition. No window where two versions serve real traffic. Either everyone is on v1 or everyone is on v2.

Cons:

2x infrastructure cost during deploys. If your prod is 100 servers, you need another 100 standing by for green. Either you accept the cost or you only spin up green when needed (and then you wait for it to warm up).
Database is shared. Both blue and green write to the same DB. Schema changes are still risky.
Long-lived connections. WebSockets, SSE, gRPC streams. They were established to blue. After the flip, you have to drain them gracefully.
Stateful services. If your app holds in-memory state (sessions, caches, queues), the flip might lose it.

When to use: when the absolute fastest possible rollback is worth more than the infrastructure cost. Banking, payments, anything where a 5-minute incident is a board-level event.

Strategy 4: Canary

Send a small percentage of real traffic to the new version. Watch the metrics. If they look fine, increase the percentage. If they look bad, route everyone back to the old version. The new version is a "canary in the coal mine".

Canary Rollout Stages

Stage 1: 1% canary

Router

v1 (99%)

v2 (1%)

Stage 2: 25% canary

Router

v1 (75%)

v2 (25%)

Stage 3: 100%

Router

v2 (100%)

Typical progression: 1% → 5% → 25% → 50% → 100%, with a hold at each stage to let metrics stabilize. Could be minutes (high-volume service) or hours (lower volume; you need enough samples).

What you watch:

Error rate (5xx, exceptions, panics).
Latency (p50, p95, p99).
Throughput (RPS).
Saturation (CPU, memory, DB connections).
Business metrics (checkout success rate, login rate, etc.).

The trick: you compare these between v1 and v2 in real time. If v2 is statistically worse, abort. Tools like Argo Rollouts, Flagger, Spinnaker do this analysis automatically.

Pros:

Real production validation.
Limited blast radius (most users still on v1 if v2 breaks).
Surfaces bugs that only happen at scale or with weird real-world inputs.
Can be fully automated end to end.

Cons:

Requires sophisticated traffic routing (service mesh, smart load balancer, ingress controller with weighting).
Requires solid metrics infrastructure. If you can't tell whether v2 is worse than v1, canary is theatre.
More moving parts in the deploy pipeline.
Two versions running side-by-side, with all the schema/API compatibility headaches that brings.

When to use: when you ship often, the cost of a bad release is high, and you have the metrics + tooling to back it up. The gold standard for high-stakes services at companies like Netflix, Google, and Facebook.

Canary vs Rolling: They Look Similar But Aren't

People confuse these. Both have v1 and v2 running together, both gradually shift traffic. The differences:

Rolling shifts whole pods. Once a v2 pod is up, it gets a normal share of traffic (typically 1/N of total). You can't run "1% canary" because that ratio depends on pod count.

Canary shifts traffic by weight, independent of pod count. You can have 50 v1 pods and 1 v2 pod and route 1% of traffic to it, or you can have 1 v1 pod and 1 v2 pod and route 1% of traffic to v2. The router (service mesh, weighted ingress) makes that possible.

Rolling is "replace the fleet gradually". Canary is "test on a small slice and decide whether to continue".

Strategy 5: Feature Flags

Deploy code in disabled state. Enable it for a percentage of users via a flag. The flag check happens at runtime, so deploy and release are decoupled.

Code looks like this:

if (featureFlags.isEnabled("new_checkout_flow", user)) {
  return newCheckoutFlow(user);
} else {
  return oldCheckoutFlow(user);
}

You ship the feature to production with the flag off. Later, you turn it on for 1% of users, then 10%, etc. If anything breaks, you turn it off, no redeploy needed.

Why it's powerful:

You can deploy code constantly without releasing it.
You can release to specific user segments (premium users only, internal employees, country X).
You can A/B test (some users see new flow, others see old, compare conversion).
Kill switches: if something goes wrong, flip a single switch in a UI without touching deploys.

The downsides:

Code complexity grows: every feature has if/else around it.
Flags accumulate. If you don't clean them up, you end up with thousands of dormant flags creating a maintenance nightmare. Most flag platforms now ship "stale flag" alerts.
Testing combinatorics: with 10 flags, you have 1024 possible app states. You can't test all of them.
Performance: every flag check is a function call, sometimes a network call. Cache them.

Tools: LaunchDarkly, Split, Unleash, GrowthBook, Flagsmith, in-house systems. Most large companies eventually build their own.

When to use: always, eventually. Even small teams benefit. The discipline of "flag everything risky" pays for itself the first time you avoid a 3 AM rollback because you flipped a flag instead.

Strategy 6: Shadow / Mirror

Production traffic is duplicated. The original goes to v1 (which serves the user). A copy goes to v2 (whose response is discarded). The user never sees v2's output, but v2 sees real traffic and you can measure how it would have performed.

Shadow Traffic Flow

Users

request

Mirroring Proxy

v1 (response → user)

v2 (response discarded)

What it tests: performance under real traffic shapes. Does v2 handle the actual mix of requests? Does it hit unexpected DB contention? Memory leaks? Are p99 latencies acceptable?

What it does not test: user-facing behavior. Users never see v2, so UI bugs, rendering issues, broken flows are invisible.

The hard part: side effects. If v2 writes to a database, do you let it? If you do, you've corrupted data. If you don't, you're not really testing the write path. Same for sending emails, charging credit cards, calling external APIs. Most shadow setups disable side effects in v2 or route them to a sandbox.

When to use: backend services where correctness and performance under load matter more than UI. Database migrations, payment processors, ML inference services. Less common in web app deploys.

Side-by-Side Comparison

	Downtime	Rollback	Cost	Risk	Complexity
Recreate	Yes	Slow (re-deploy)	Low	High	Trivial
Rolling	None	Slow (rolling back)	Low	Medium	Low
Blue/Green	None	Instant	2x briefly	Low	Medium
Canary	None	Quick (route back)	Slightly higher	Lowest	High
Feature Flags	None	Instant (toggle)	Standard	Lowest	Medium
Shadow	None	N/A (no users on v2)	2x backend	None to user	High

The Database Problem

Every strategy above lets you run two app versions side-by-side. Your database can't easily run "two versions" at the same time. There is one schema, and both code versions have to work with it.

This is where most deployment failures actually live. Not in Kubernetes config, not in the load balancer. In the migration that worked in staging but broke in production because someone was still running v1 when it ran.

The cardinal rule: never deploy code and migrations at the same time. Decouple them.

Pattern: Expand-and-Contract (also called "parallel change").

You want to rename column user_email to email. Don't do this in one step.

Step 1 (expand): add the new column email. Backfill data. Both columns exist. Old code reads/writes user_email. Deploy this migration with no app changes.

Step 2: deploy app code that writes to BOTH columns and reads from email (with fallback to user_email). Now both code versions work fine.

Step 3: deploy app code that only uses email. The old column is now unused.

Step 4 (contract): drop the old column. Schema is clean.

This takes weeks across multiple deploys. It is the only safe way. Anyone who tells you they renamed a column atomically in production is lying or about to find out.

Other safe patterns:

Always make new columns nullable or with defaults. NOT NULL on a backfilled column requires the backfill to complete first.
Avoid type changes on hot tables. Add a new column instead.
Index creation should be CONCURRENTLY in Postgres or via online schema change tools (gh-ost, pt-online-schema-change) in MySQL.
Foreign keys: add as NOT VALID first, then validate in a later step.

Health Checks: The Foundation of Automated Deploys

Whichever strategy you pick, the orchestrator needs to know if the new version is healthy. Otherwise it will happily route traffic to a broken pod and your users will see errors.

Three levels of health check:

Liveness probe: "Am I alive?" If no, restart me. Used to detect deadlocks, infinite loops, stuck threads. A liveness failure means the process is in a bad state and a restart might fix it.

Readiness probe: "Am I ready to serve traffic?" If no, take me out of the load balancer. Common during startup (warming caches, connecting to DB) or under load (queue full, can't accept more). A readiness failure does NOT restart the pod, just stops sending it traffic.

Startup probe: "Have I finished starting yet?" Useful for slow-starting apps (large JVMs, ML models loading). Liveness and readiness probes are suspended until startup succeeds, so a slow boot doesn't trigger restarts.

Common health check mistakes:

1. Health check that always returns 200. Useless. It needs to actually check something (DB connection, cache reachable, dependencies up).

2. Health check that fails too aggressively. If you fail readiness when one downstream is slow, you'll pull all your pods out of rotation and create a stampede.

3. Liveness checks that depend on external services. If the database goes down, your pod's liveness fails, the orchestrator restarts it, and... it still can't reach the DB. Now you've added pod restart pressure on top of an outage. Liveness should check internal state. Readiness can check externals.

4. No timeout. If your check hangs, the orchestrator will too.

Automated Rollback

The fastest human can react in maybe 30 seconds. By then, error rates have spiked and customers are complaining. Automated rollback is faster and doesn't require a human to be awake.

How it works in canary:

1. Define an "analysis" with metrics to monitor (error rate, latency, custom business metrics).
2. Define thresholds: "v2 error rate must be no more than 1.5x v1 error rate".
3. Run analysis at each canary stage.
4. If analysis fails, route 100% to v1 and alert humans.

Tools: Argo Rollouts (Kubernetes-native, integrates with Prometheus), Flagger (similar), Spinnaker (Netflix's platform). LaunchDarkly has automated kill switches based on metrics.

Manual rollback as a fallback. Even with automation, a human should be able to roll back with one command. kubectl rollout undo, argo rollouts abort, "flip the LB back to blue". Practice this in non-production. The middle of an incident is the wrong time to learn the rollback procedure.

Long-Lived Connections (The Drain Problem)

WebSockets, gRPC streams, server-sent events, long-polling. These are connections that stay open for hours. When you deploy, what happens to them?

Default behavior: the orchestrator sends SIGTERM, the pod stops accepting new connections, but existing ones get killed when SIGKILL hits (default 30 seconds later).

Pattern: graceful drain.

1. Pod receives SIGTERM.
2. Pod sets readiness probe to fail. Load balancer stops sending new connections.
3. Pod sends "please reconnect" signal to existing clients (close frame, redirect, etc.).
4. Pod waits for in-flight requests to finish (typical: 30s to 5min).
5. Pod exits.

Configure the orchestrator's terminationGracePeriodSeconds long enough for this drain. Otherwise SIGKILL will cut you off mid-request.

Stateful Workloads

Stateless apps deploy easily because they're interchangeable. State complicates everything.

Sticky sessions in memory: if user X's session is on pod A, and pod A is being replaced, where does the session go? Either you put sessions in shared storage (Redis, DB) or you accept session loss on deploy. Most teams pick shared storage.

Local caches: after a rolling deploy, every new pod starts with cold cache. Latency spikes until they fill. Mitigate with cache warming on startup, or longer rollouts so not all pods are cold at once.

Databases and stateful systems: you don't roll these like apps. Use StatefulSets in Kubernetes (ordered, named pods with stable storage). Updates are sequential, not parallel. Often involve leader election (only one primary at a time).

Choosing a Strategy

If you're a small team / single service / not on K8s yet: rolling. It's good enough. Set up health checks. Practice rollback.

If you ship daily and uptime matters: rolling + feature flags. Deploy code constantly behind flags. Release deliberately by flipping flags.

If incidents are board-level events: blue/green. The 2x cost is worth knowing rollback is one second. Common in finance, healthcare.

If you ship many times per day at scale: canary + feature flags + automated metric analysis. Argo Rollouts on Kubernetes is the modern default. This is what Netflix, Uber, Airbnb run.

If you're testing high-stakes backend changes: shadow on top of one of the above. Use it to validate v2 against real traffic before shifting any users.

Edge Cases and Operational Concerns

Mixed-version chaos: during any non-instantaneous strategy, two versions run together. APIs must be backwards-compatible. Message formats (Kafka, queues) must be backwards-compatible. Database schemas must be backwards-compatible. If you skip this, you'll get bizarre intermittent bugs that only happen during deploys.

The retry storm: if v2 starts erroring and clients retry, you can hammer the surviving v1 pods into oblivion. Rate limits, circuit breakers, exponential backoff, jitter. They feel like overhead until they save you.

Cron jobs and scheduled tasks: if a job is running during deploy, it might be killed mid-execution. Make jobs idempotent (safe to re-run) and graceful (handle SIGTERM).

External dependencies during deploy: third-party APIs, partner integrations, etc. They don't know you're deploying. If your deploy briefly breaks the integration, you'll get noisy alerts from them.

Cold starts: new pods load classes, JIT compile, fill connection pools. The first few seconds are slow. Pre-warm with synthetic traffic before they go live, or expect a small latency spike on every deploy.

Configuration drift: v1 had env var FOO=1, v2 has FOO=2. If you forgot to update the deployment manifest, v2 inherits FOO=1 and behaves oddly. GitOps (config in version control, applied automatically) helps. Manual config changes are a constant source of incidents.

Cache poisoning across versions: if v1 wrote a cache value in format A and v2 reads format B, you get errors. Either version cache keys explicitly (include the schema version in the key) or invalidate caches on deploy.

Time-of-day: deploy at low traffic. 3 AM has fewer users to break, but also fewer people awake to fix things. Tuesday at 10 AM is the classic compromise: enough traffic to see problems, plenty of engineers around. Don't deploy on Friday afternoon. Don't deploy before a holiday.

Observability During Deploys

You can't roll back what you can't see. The bare minimum:

Deploy markers on your dashboards. When metrics spike, you want to immediately see "this happened right after the v2 deploy".
Per-version metrics. Tag every metric with the version. Compare v1 vs v2 in real time.
Distributed tracing. Trace which version handled which request. Identify v2-only failures.
Logs with version tags. Same idea. Filter by version when investigating.
Synthetic checks. Outside-in monitoring (homepage loads, login works) that catches issues regardless of internal metrics.

Deploy Frequency vs Strategy Investment

If you deploy once a quarter, the deploy is a big event. You can afford manual checks, change advisory boards, day-long maintenance windows. Recreate or basic rolling is fine.

If you deploy 100 times a day, every deploy must be cheap, fast, and automatic. Canary with automated metric analysis. Feature flags for risk management. Rollback in seconds. Investment in tooling pays back immediately.

The pattern: the more you deploy, the safer each deploy needs to be. Counterintuitive but true. High-frequency deploys force you to build the safety. Low-frequency deploys let you skip it, until the day you really need it.

Common Anti-Patterns

"Big bang" releases: three months of work, deployed all at once. Maximum risk, maximum stress, maximum chance of rollback. Break work into small deploys behind flags.

Deploys that require human steps: "first run this script, then deploy, then run this other thing". Humans forget steps. Automate or eliminate.

"It works in staging, ship it": staging never has the traffic shape, data shape, or scale of production. Canary catches what staging missed.

Skipping rollback drills: you don't know if rollback works until you've done it under pressure. Practice quarterly.

Treating deploys as a one-way door: "we shipped, we're committed". No. Every deploy should be reversible cheaply for at least an hour after.

Mixing infrastructure and application changes: "deploy v2 AND increase pod count AND change DB params". When something breaks, which change caused it? One change per deploy.

The One Thing to Remember

Pick the simplest strategy that meets your needs, not the fanciest one you've heard of. Most teams should run rolling deploys with feature flags and good health checks. That covers 95% of cases. Blue/green for environments where instant rollback is worth 2x cost. Canary plus feature flags for high-frequency, high-stakes systems. The mechanism matters less than the discipline: backwards-compatible schema changes, automated rollback on failed health checks, observability per version, and the courage to roll back fast when something looks wrong instead of trying to fix forward at 3 AM. The hardest part of deployment isn't the deploy. It's making every change small enough, reversible enough, and observable enough that the deploy itself becomes boring.