Skip to content

[codex] Roll back failed canary health checks#192

Open
YoungJinJung wants to merge 1 commit into
codex/weighted-canary-bake-timefrom
codex/canary-health-rollback
Open

[codex] Roll back failed canary health checks#192
YoungJinJung wants to merge 1 commit into
codex/weighted-canary-bake-timefrom
codex/canary-health-rollback

Conversation

@YoungJinJung

Copy link
Copy Markdown
Contributor

Implemented

  • Added automatic rollback for canary health check failures and health check timeouts.
  • Restores the listener default action back to the original target group before returning the health failure.
  • Detaches and deletes the canary target group created for the failed rollout.
  • Scales the failed canary Auto Scaling group to 0, waits for its instances to terminate, then deletes the Auto Scaling group and launch template.
  • Stops the runner when the health check step returns an error, so failed canaries do not continue into later deployment phases.
  • Tracks the active canary Auto Scaling group via AsgNames instead of using LatestAsg, which represents the stable group during the initial canary deployment.
  • Added English and Korean documentation for health check rollback behavior and its scope.

Tested

  • env -u GOROOT go test ./pkg/deployer -run TestShouldRollbackCanary
  • env -u GOROOT go test ./pkg/deployer ./pkg/runner
  • env -u GOROOT go test -count=1 ./pkg/... ./cmd/... ./hack/...
  • env -u GOROOT make linters
  • git diff --check
  • Live AWS rollback test in ap-northeast-2 using demoapp-xyzdapne2-ext:
    • Deployed an intentionally unhealthy canary userdata script.
    • Confirmed target health stayed unhealthy.
    • Confirmed the deploy command returned GOPLOYER_EXIT:1.
    • Confirmed the listener was restored to the stable target group.
    • Confirmed stable ASG demoapp-xyzd_apnortheast2-v001 remained 1/1/1 and healthy.
    • Confirmed canary target group demoapp-xyzd-canary-v001 was deleted.
    • Confirmed canary v002 launch templates were removed.

Note: make linters failed once with the local GOROOT pointing at a gvm Go install whose export data did not match the active Go tool version. Re-running with env -u GOROOT passed.

Remaining

  • CloudWatch alarm or synthetic-test based post-deploy rollback is still not implemented. This PR only rolls back failures observed by Goployer's health check step before the deploy command exits.
  • Listener rule based canary routing remains outside this PR and the stacked weighted target group PR.

Health check failures previously returned from the canary deployer only as a local error path, while the runner did not stop the deployment after health errors. The canary rollback now restores listener traffic, removes canary-only target group resources, scales the failed canary ASG down, waits for termination, and deletes the ASG and launch template before returning the original health failure.

The live rollback test also exposed that the stable ASG is stored in LatestAsg while the newly created canary ASG is stored in AsgNames. Rollback and canary target group detach now select the active canary ASG first to avoid touching the stable group.

Constraint: Canary rollback must reuse the existing listener default action and target group cleanup helpers from the weighted target group implementation

Rejected: Leave failed canary ASGs scaled to zero | operators would still need manual cleanup after every failed canary

Rejected: Roll back using LatestAsg only | live AWS test showed this targets the stable ASG during initial canary deploy

Confidence: high

Scope-risk: moderate

Directive: Do not replace AsgNames with LatestAsg in rollback paths without re-running an unhealthy canary against a real or mocked ASG/TG lifecycle

Tested: env -u GOROOT go test -count=1 ./pkg/... ./cmd/... ./hack/...

Tested: env -u GOROOT make linters

Tested: AWS ap-northeast-2 unhealthy canary deploy returned GOPLOYER_EXIT:1, restored listener to stable TG, left v001 1/1/1 healthy, deleted canary TG, and removed v002 launch templates

Not-tested: CloudWatch alarm based post-deploy rollback, which is not implemented by this change
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8e609318-9a2f-4c78-96c4-feba33769d1b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/canary-health-rollback

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@amazon-q-developer amazon-q-developer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR implements automatic rollback for canary health check failures, which is a valuable reliability improvement. The implementation includes proper error handling, step status management, and comprehensive test coverage.

Key Changes

  • Automatic rollback on health check failure or timeout
  • Restores listener to stable target group
  • Cleans up failed canary resources (ASG, target group, launch template)
  • Prevents failed canaries from being promoted in later phases
  • Stops runner execution on health check errors

Findings

1 issue identified requiring attention:

The rollback timeout handling reuses the same config.Timeout as the initial health check. If a health check times out, the rollback starts with zero remaining time budget, which could prevent proper cleanup of AWS resources.

The implementation is otherwise well-structured with good error handling and test coverage. Once the timeout concern is addressed, this will be a solid addition.


You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.

Comment thread pkg/deployer/canary.go
Comment on lines +193 to +223
rollbackStartTimestamp := time.Now().Unix()

for _, region := range c.Stack.Regions {
if config.Region != "" && config.Region != region.Region {
c.Logger.Debug("This region is skipped by user : " + region.Region)
continue
}

if c.OriginalTargetGroupArn[region.Region] != "" {
if err := c.ModifyListenerToStableTargetGroup(region, 100, 0); err != nil {
recordErr(err)
}
}

canaryASG := c.CanaryAutoScalingGroupName(region.Region)
if err := c.DetachLatestCanaryResources(schemas.Config{Region: region.Region}); err != nil {
recordErr(err)
}

if canaryASG != "" {
client, err := selectClientFromList(c.AWSClients, region.Region)
if err != nil {
recordErr(err)
} else if err := c.ResizingAutoScalingGroupCount(client, canaryASG, 0); err != nil {
recordErr(err)
} else {
for {
isTimeout, _ := tool.CheckTimeout(rollbackStartTimestamp, config.Timeout)
if isTimeout {
recordErr(fmt.Errorf("timeout has been exceeded : %.0f minutes", config.Timeout.Minutes()))
break

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollback logic uses config.Timeout which is shared with the initial health check. If the health check times out, the rollback process starts with zero remaining time, causing it to immediately timeout at line 221. Consider tracking rollback time separately or reserving timeout capacity for rollback operations to ensure cleanup completes.

@YoungJinJung YoungJinJung marked this pull request as ready for review June 27, 2026 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant