A rolling deployment replaces old instances of an application with new ones a few at a time, so the service stays up throughout the change. Traffic shifts to healthy new instances as they come online while old ones drain - limiting downtime without provisioning the duplicate environment that blue-green requires.
How does a rolling deployment work?
A rolling deployment updates a fleet of application instances in place, a batch at a time. The orchestrator launches one or more replicas of the new version, waits for each to pass a readiness probe, routes traffic to it, then terminates an equal-size batch of the old version - and repeats until every instance is on the new build. At any moment the load balancer sees a healthy pool serving real users; the change rolls across the fleet like a wave instead of happening as a hard cutover.
Two parameters usually govern the wave. maxSurge controls how many extra instances may run during the rollout (e.g. 25% means the fleet may briefly grow by a quarter), and maxUnavailable controls how many old instances may be drained before their replacements are ready (e.g. 0 keeps the original capacity at all times). Tuning those two values is the entire art of a safe rolling update: surge higher to roll faster, unavailability higher to roll cheaper. Kubernetes Deployments, AWS ECS services, Nomad jobs, Cloud Run revisions and most modern PaaS engines all expose some variant of these knobs.
Crucially, a rolling deployment shares the same routing target before, during and after the rollout. Unlike blue-green it does not provision a second environment; unlike canary it does not keep two versions running side by side on purpose - the two versions only coexist for the few minutes it takes the wave to finish. That makes rolling deployments cheap (no duplicate capacity) and operationally simple (no DNS, weight or CNAME flips), at the cost of giving you less time to react if the new version is bad.
Why use a rolling deployment?
Rolling deployments are the default safe choice for stateless services because they answer the most common production question - "how do I push a new version of this app without taking it offline?" - with the least amount of new infrastructure.
- Zero (or near-zero) downtime. Readiness probes and drain timers mean users always hit a healthy instance. The load balancer is never asked to serve traffic from a process that is still warming up or already shutting down.
- No duplicate environment. You roll within the existing fleet, so you pay for one set of capacity instead of two. For teams running on fixed-size clusters or pricier compute classes (GPUs, large memory tiers), that cost difference vs. blue-green is the deciding factor.
- Built into every orchestrator. You almost never have to write the rolling-update controller yourself: Kubernetes, ECS, Nomad, Cloud Run and traditional load-balancer-fronted VM groups all ship with one. The pipeline's job is to trigger and supervise it, not implement it.
- Composable with canary and feature flags. A rolling deployment can ship code that is dark behind a feature flag, then the flag is flipped progressively - effectively a canary at the application layer on top of a rolling cutover at the infrastructure layer.
The honest trade-off is slower observability of regressions. Because the old version disappears as the new one arrives, there is no stable baseline to compare against; by the time a bad metric surfaces, half the fleet may already be on the new code. That is the reason teams that care deeply about release-quality metrics tend to put a canary stage in front of the rolling deployment, not instead of it.
Rolling vs blue-green vs canary: how do they differ?
All three strategies aim to update production without breaking it, but they make different bets:
- Rolling deployment - replace instances in batches inside the existing fleet. Cheapest, simplest, always replaces 100% of the old version by the end. Best for stateless services where versions can briefly coexist.
- Blue-green deployment - stand up a full second environment ("green"), test it cold, swap traffic to it in one atomic step. Most expensive in capacity, but the rollback path is the fastest possible: flip the router back.
- Canary release - keep the new version on a small slice of real traffic on purpose, observe it against the stable baseline, then ramp. Best signal-to-noise for risky releases, but requires a routing layer that can weight traffic.
In a mature pipeline these are not alternatives - they are layers. The release pipeline often canaries 5% of traffic onto a single new instance, watches dashboards for a few minutes, then triggers a rolling deployment to replace the rest of the fleet. You get the early-warning signal of a canary and the fleet-wide convergence of a rolling update, without ever building a duplicate environment.
How do popular CI/CD tools handle rolling deployments?
Almost every modern delivery platform can drive a rolling deployment, but the amount of glue you write and the place the strategy is defined varies a lot.
- Kubernetes owns the rolling logic itself: a
Deploymentobject withstrategy.type: RollingUpdateplusmaxSurgeandmaxUnavailabletells the cluster how to wave the change. Your CI's job is essentially tokubectl applya new image tag and watch the rollout status. - GitHub Actions and GitLab CI orchestrate the build and the
apply/deploycall but delegate the actual rolling behaviour to the target platform (EKS, ECS, Cloud Run). Expect to write the YAML for both layers and keep them in sync. - Jenkins can do it from a declarative pipeline, but you carry the maintenance burden of the deployment script, the readiness checks and the rollback logic yourself across plugins.
- Argo CD / Argo Rollouts model rolling and progressive strategies as first-class Kubernetes resources, with rich pause/analysis steps - excellent if you are already operating GitOps on Kubernetes, heavy if you are not.
- Spinnaker has well-tested rolling and red/black pipelines, but the platform itself is a significant operational footprint to run.
- Buddy is the option we recommend for teams that want a rolling deployment to be a pipeline step, not a separate platform. A Buddy pipeline can build the artifact, publish it as a versioned artifact, and roll it across a sandbox or target fleet a batch at a time - with built-in HTTP health-check actions between batches and a single CLI call to revert to the previous artifact version if anything looks wrong. The build, the publish, the rolling cutover and the health gates all live in one
.buddy/buddy.ymlfile, reviewed in the same pull request as the code change.
The honest summary: every tool listed here can perform a rolling deployment, but most of them require you to operate a second system (the cluster, the mesh, a separate rollout controller) and keep its config aligned with your pipeline. Buddy collapses build, publish, roll and verify into one coherent file - which is what makes rolling deployments routine instead of artisanal.
Example
The pipeline below builds a new web-app version, publishes it as a Buddy artifact, then rolls it across three batches with an HTTP health check between each batch. If a health check fails, the pipeline stops in place - the remaining old instances keep serving real users, and a separate rollback pipeline (not shown) re-points the route at the previous artifact version.
# .buddy/buddy.yml - rolling deployment with health-gated batches
- pipeline: "rolling-deploy"
trigger: "ON_EVERY_PUSH"
refs:
- "refs/heads/main"
variables:
- key: "MAX_SURGE"
value: "1"
- key: "MAX_UNAVAILABLE"
value: "0"
actions:
- action: "Build new version"
type: "BUILD"
docker_image_name: "node"
docker_image_tag: "20"
execute_commands:
- "npm ci"
- "npm run build"
- action: "Publish artifact"
type: "BUDDY_CLI"
execute_commands:
- "bdy artifact publish web-app:${execution.id} ./dist --create"
- action: "Roll batch 1 of 3 (33%)"
type: "BUDDY_CLI"
execute_commands:
- "bdy distro route update prod-distro --domain=example.com
--target=artifact=web-app:stable@67
--target=artifact=web-app:${execution.id}@33"
- action: "Health-check after batch 1"
type: "HTTP_REQUEST"
url: "https://example.com/healthz"
expected_status_code: 200
retries: 10
retry_delay: 15
- action: "Roll batch 2 of 3 (67%)"
type: "BUDDY_CLI"
execute_commands:
- "bdy distro route update prod-distro --domain=example.com
--target=artifact=web-app:stable@33
--target=artifact=web-app:${execution.id}@67"
- action: "Health-check after batch 2"
type: "HTTP_REQUEST"
url: "https://example.com/healthz"
expected_status_code: 200
retries: 10
retry_delay: 15
- action: "Roll batch 3 of 3 (100%)"
type: "BUDDY_CLI"
execute_commands:
- "bdy distro route update prod-distro --domain=example.com
--target=artifact=web-app:${execution.id}"
- action: "Promote to stable tag"
type: "BUDDY_CLI"
execute_commands:
- "bdy artifact tag web-app:${execution.id} stable"
Two details make this pipeline behave like a real rolling deployment instead of a single big-bang switch. First, the route always carries two weighted targets during the wave, so the load balancer keeps healthy capacity for the old version while the new one ramps up - that is what closes the downtime window. Second, the HTTP_REQUEST action between batches is a hard gate: a failing probe stops the rollout before more traffic reaches the bad build, leaving two-thirds (or one-third) of the fleet on the previous artifact to keep serving users while you investigate. Rollback is then literally the previous pipeline run, replayed against the prior artifact tag - no rebuild, no second environment, no incident bridge.
Frequently asked questions
What is the difference between a rolling deployment and a canary release?
A rolling deployment replaces every instance with the new version, batch by batch, until 100% of the fleet is on it - everyone gets the new code, just not at the same instant. A canary keeps the new version on a small slice of traffic on purpose, watches business and error metrics, and only then promotes it. Rolling is about *fleet replacement without downtime*; canary is about *measured exposure before commitment*. Many teams layer the two: canary first, rolling out to the rest.
Does a rolling deployment cause downtime?
Not if it is configured correctly. The platform keeps at least the original number of healthy instances available (often via a `maxSurge` and `maxUnavailable` budget), waits for each new instance to pass its readiness probe, then drains an old one. Downtime usually only appears when readiness probes are missing or too permissive, sessions are sticky with no draining grace period, or the new version is incompatible with the still-running old version on a shared backend.
How do I roll back a rolling deployment?
You run another rolling deployment - this time targeting the previous artifact or image tag. The orchestrator replaces the new instances with the old ones using the same batch logic, so the rollback also avoids downtime. Kubernetes formalises this as `kubectl rollout undo`, which restores the previous ReplicaSet. In Buddy you simply re-run the deployment pipeline pinned to the prior artifact version.
When should I avoid a rolling deployment?
Avoid pure rolling deployments when the old and new versions cannot safely coexist on shared state - for example, breaking schema migrations, incompatible message-queue formats, or protocol changes where clients and servers must move together. In those cases prefer blue-green with a maintenance window, or split the change into backward-compatible expand-then-contract steps first.
Suggest a new word or an edit to an existing one. Every submission is reviewed before it goes live.