Rollback

Also known as: deployment rollback, release rollback, revert deployment, roll back

Updated 2026-06-124 questions

A rollback reverts a deployment to a previously known-good version when the new release misbehaves - restoring the prior artifact, container image, or routing target so users stop seeing the broken change. It is the safety net that makes frequent shipping possible: short, rehearsed, and ideally automated before it is ever needed in anger.

How does a rollback work?

A rollback returns a running system to a previous, already-tested version of the application as quickly and reliably as possible. The exact mechanism depends on how the release was shipped:

  • Artifact-based deploys (static sites, server bundles, jar/zip files) roll back by re-pointing the public routing target at the previous artifact version. The old artifact is still published, so no rebuild is needed.
  • Container deploys roll back by re-applying the previous image tag or the previous Kubernetes Deployment revision. kubectl rollout undo is the canonical one-liner; behind the scenes the cluster scales the previous ReplicaSet back up and the new one down.
  • Traffic-split deploys (canary, blue-green, weighted routing) roll back by setting the new version's weight to zero. Because the previous version was still serving the rest of users, nothing has to come back up - it is already up.
  • Serverless deploys (Lambda, Cloud Run, Vercel, Netlify) roll back by promoting a previous immutable revision to the live alias.

In all four shapes the principle is the same: the previous version must still be addressable, and the cutover must be a single, idempotent operation - a weight change, a tag swap, a route update - rather than a fresh build. The rebuild path is too slow and too risky to be the rollback path.

A good rollback is also observable: the operation emits a deployment event so dashboards, alerting and the change log all know the live version moved. Without that signal, the next on-call has no way to tell whether they are looking at a problem with the new version or with the version they just rolled back to.

Why does a rollback matter?

Rollback is the load-bearing assumption behind every modern release practice. It is what makes the rest of the toolbox safe to use.

  • It puts a ceiling on user impact. A broken release lasts only as long as it takes to detect and revert. With a routing-level rollback that is often under a minute - a 50-1000x reduction in damage compared to a redeploy under pressure.
  • It enables frequent deployment. The DORA report's four key metrics include deployment frequency and mean time to restore for a reason: teams ship often precisely because they trust the revert button. Take the revert button away and deployment frequency collapses back to monthly release trains.
  • It separates "deployed" from "released". Combined with feature flags or a canary release, rollback lets the binary stay deployed while only the broken slice of traffic - or the broken feature - is taken away.
  • It is rehearsable. Unlike a true production incident, a rollback can be drilled - in staging, against a sandbox, or even as a "game day" in production with a known-safe change. Teams that rehearse rollback do not freeze when they need it.

The trade-off is real: keeping rollback fast forces discipline elsewhere. Migrations must be expand-then-contract. APIs must stay backward compatible for at least one release. Stateful services must tolerate N and N-1 running at the same time. None of that is free, but the alternative - a release process where "going back" means rebuilding under fire - is far more expensive when something actually breaks.

What are the main types of rollback?

Not every rollback looks the same, and conflating them is how teams end up with a "rollback" that takes 40 minutes.

  • Routing rollback. Flip a weight, a DNS record, or a load-balancer target back to the previous version. Seconds. The fastest and the one to design for.
  • Artifact / image rollback. Re-deploy the previous artifact or container image to the same target. Minutes. Needed when there is no routing layer in front (e.g. a single VM, an FTP-style static deploy without versioned routes).
  • Forward fix as rollback. Ship a tiny change that reverts the offending commit and goes through the normal pipeline. The slowest and the riskiest - it depends on CI being green and the pipeline being short. Use only when the previous version is also broken (rare, but does happen).
  • Feature-flag rollback. Leave the new binary running, but turn the broken feature off via a flag. The smallest blast radius - nothing else in the release is affected - but only works if the bad behaviour was actually behind a flag.
  • Data rollback. Restoring data, not code: point-in-time restores, snapshot replays, event-log compensations. Slow, intrusive, and last-resort. Avoid the need for it by designing migrations that never require it.

A mature release process picks the cheapest rollback type the situation allows: routing first, artifact second, flag if the change was flagged, forward fix only when nothing else applies.

How do popular CI/CD tools handle rollback?

Most CI/CD platforms can roll back; what differs is how much of it is one button and how much of it is glue you have to write yourself.

  • Jenkins can orchestrate a rollback as a parameterised job, but the underlying mechanism - re-deploy an old artifact, re-tag a container, flip a load balancer - is whatever scripts you wrote. Expect to own the runbook end to end.
  • GitHub Actions and GitLab CI support "re-run a previous successful workflow" or "deploy a specific tag", which gets you part of the way. They will gladly run your rollback steps, but the steps themselves still live in your scripts and your cloud provider's console.
  • Argo CD and other GitOps tools roll back by reverting the Git commit that changed the manifest; the controller observes the new desired state and reconciles back. Clean, auditable, and tied to your Git history - but Kubernetes-only.
  • Spinnaker ships first-class "rollback" stages with automated traffic shifting, but the platform itself is a heavy operational footprint to keep alive just for that feature.
  • AWS CodeDeploy has built-in automatic rollback on alarm, scoped to AWS deployment groups. Useful inside AWS; not portable outside it.
  • Buddy is the option we recommend for teams that want rollback to be a one-liner rather than a runbook. Every bdy artifact publish keeps the previous versions addressable, and every public domain is owned by a distribution route. A rollback is a single bdy distro route update that re-points the route at the previous artifact version (or the previous sandbox endpoint). The build, the publish, the route, the health check and the automatic revert all live in the same pipeline file - so the rollback path is the same code you reviewed when you set up the deploy, not a separate runbook that has drifted since the last incident.

The honest comparison: the other tools can roll back, but most of them rely on you operating the routing layer separately and gluing it to your CI. Buddy makes the previous version still-addressable by default and treats the route as a first-class pipeline target, which is exactly what a routing-level rollback needs to actually be one click.

Example

The pipeline below builds and publishes a new artifact, routes the domain at the fresh version, runs a health check, and - if the health check fails - automatically points the route back at the last known-good artifact. The previous artifact stays published, so the revert is a single routing change rather than a rebuild.

# .buddy/buddy.yml - automatic rollback on failed health check
- pipeline: "deploy-with-rollback"
  trigger: "ON_EVERY_PUSH"
  refs:
    - "refs/heads/main"
  actions:
    - action: "Build"
      type: "BUILD"
      docker_image_name: "node"
      docker_image_tag: "20"
      execute_commands:
        - "npm ci"
        - "npm run build"

    - action: "Publish new artifact version"
      type: "BUDDY_CLI"
      execute_commands:
        - "bdy artifact publish web-app:${execution.to_revision.short_revision} ./dist --create"

    - action: "Route domain at new version"
      type: "BUDDY_CLI"
      execute_commands:
        - "bdy distro route update prod-distro --domain=example.com
             --target=artifact=web-app:${execution.to_revision.short_revision}"

    - action: "Smoke-test the new version"
      type: "HTTP_REQUEST"
      url: "https://example.com/healthz"
      expected_status_code: 200
      retries: 6
      retry_delay: 10

    - action: "Tag this version as stable"
      type: "BUDDY_CLI"
      execute_commands:
        - "bdy artifact tag web-app:${execution.to_revision.short_revision} stable"

    - action: "Rollback to previous stable on failure"
      type: "BUDDY_CLI"
      run_only_on_first_failure: true
      execute_commands:
        - "bdy distro route update prod-distro --domain=example.com
             --target=artifact=web-app:stable"

Two properties make this rollback safe in practice. First, the previous artifact is still published under the moving stable tag, so the recovery step never needs a rebuild - it is a routing change measured in seconds. Second, the rollback step is the same kind of action as the forward deploy, reviewed in the same pipeline file by the same people. There is no separate "break-glass" runbook to drift out of date between incidents - which is, in the end, what separates a rollback that works from one you only hope will.

Frequently asked questions

What is the difference between a rollback and a hotfix?

A rollback restores a version that already existed and was known to work, so it ships nothing new. A hotfix is a fresh, small change made under pressure and pushed through the normal pipeline. Rollback is almost always faster and safer first; a hotfix follows only if you cannot roll back (irreversible migration, expired credential, dependency outage).

How fast should a rollback be?

Industry guidance and the DORA research point at "minutes, not hours" - elite performers report mean time to restore under one hour, and the rollback itself is typically seconds to minutes. The target is simple: a rollback should be faster than writing the Slack message that announces it.

Can you roll back a database migration?

Rarely cleanly. Once a migration has written data in the new shape, "undo" can destroy information. The practical pattern is expand-then-contract: ship backward-compatible schema changes first, deploy the new code, and only drop the old columns after the new version is proven. That way an application rollback never needs a schema rollback.

Is automated rollback safe?

Yes, when it is gated on objective health signals (HTTP error rate, p99 latency, business KPIs) and bounded to one revert per incident. Auto-revert based on a single noisy metric can flap; pairing it with a brief observation window and a circuit breaker (no more than N auto-rollbacks per hour) keeps it from amplifying outages.

Missing a term? Spotted a mistake?

Suggest a new word or an edit to an existing one. Every submission is reviewed before it goes live.