[codex] Roll back failed canary health checks#192
Conversation
Health check failures previously returned from the canary deployer only as a local error path, while the runner did not stop the deployment after health errors. The canary rollback now restores listener traffic, removes canary-only target group resources, scales the failed canary ASG down, waits for termination, and deletes the ASG and launch template before returning the original health failure. The live rollback test also exposed that the stable ASG is stored in LatestAsg while the newly created canary ASG is stored in AsgNames. Rollback and canary target group detach now select the active canary ASG first to avoid touching the stable group. Constraint: Canary rollback must reuse the existing listener default action and target group cleanup helpers from the weighted target group implementation Rejected: Leave failed canary ASGs scaled to zero | operators would still need manual cleanup after every failed canary Rejected: Roll back using LatestAsg only | live AWS test showed this targets the stable ASG during initial canary deploy Confidence: high Scope-risk: moderate Directive: Do not replace AsgNames with LatestAsg in rollback paths without re-running an unhealthy canary against a real or mocked ASG/TG lifecycle Tested: env -u GOROOT go test -count=1 ./pkg/... ./cmd/... ./hack/... Tested: env -u GOROOT make linters Tested: AWS ap-northeast-2 unhealthy canary deploy returned GOPLOYER_EXIT:1, restored listener to stable TG, left v001 1/1/1 healthy, deleted canary TG, and removed v002 launch templates Not-tested: CloudWatch alarm based post-deploy rollback, which is not implemented by this change
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Summary
This PR implements automatic rollback for canary health check failures, which is a valuable reliability improvement. The implementation includes proper error handling, step status management, and comprehensive test coverage.
Key Changes
- Automatic rollback on health check failure or timeout
- Restores listener to stable target group
- Cleans up failed canary resources (ASG, target group, launch template)
- Prevents failed canaries from being promoted in later phases
- Stops runner execution on health check errors
Findings
1 issue identified requiring attention:
The rollback timeout handling reuses the same config.Timeout as the initial health check. If a health check times out, the rollback starts with zero remaining time budget, which could prevent proper cleanup of AWS resources.
The implementation is otherwise well-structured with good error handling and test coverage. Once the timeout concern is addressed, this will be a solid addition.
You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.
| rollbackStartTimestamp := time.Now().Unix() | ||
|
|
||
| for _, region := range c.Stack.Regions { | ||
| if config.Region != "" && config.Region != region.Region { | ||
| c.Logger.Debug("This region is skipped by user : " + region.Region) | ||
| continue | ||
| } | ||
|
|
||
| if c.OriginalTargetGroupArn[region.Region] != "" { | ||
| if err := c.ModifyListenerToStableTargetGroup(region, 100, 0); err != nil { | ||
| recordErr(err) | ||
| } | ||
| } | ||
|
|
||
| canaryASG := c.CanaryAutoScalingGroupName(region.Region) | ||
| if err := c.DetachLatestCanaryResources(schemas.Config{Region: region.Region}); err != nil { | ||
| recordErr(err) | ||
| } | ||
|
|
||
| if canaryASG != "" { | ||
| client, err := selectClientFromList(c.AWSClients, region.Region) | ||
| if err != nil { | ||
| recordErr(err) | ||
| } else if err := c.ResizingAutoScalingGroupCount(client, canaryASG, 0); err != nil { | ||
| recordErr(err) | ||
| } else { | ||
| for { | ||
| isTimeout, _ := tool.CheckTimeout(rollbackStartTimestamp, config.Timeout) | ||
| if isTimeout { | ||
| recordErr(fmt.Errorf("timeout has been exceeded : %.0f minutes", config.Timeout.Minutes())) | ||
| break |
There was a problem hiding this comment.
The rollback logic uses config.Timeout which is shared with the initial health check. If the health check times out, the rollback process starts with zero remaining time, causing it to immediately timeout at line 221. Consider tracking rollback time separately or reserving timeout capacity for rollback operations to ensure cleanup completes.
Implemented
0, waits for its instances to terminate, then deletes the Auto Scaling group and launch template.AsgNamesinstead of usingLatestAsg, which represents the stable group during the initial canary deployment.Tested
env -u GOROOT go test ./pkg/deployer -run TestShouldRollbackCanaryenv -u GOROOT go test ./pkg/deployer ./pkg/runnerenv -u GOROOT go test -count=1 ./pkg/... ./cmd/... ./hack/...env -u GOROOT make lintersgit diff --checkap-northeast-2usingdemoapp-xyzdapne2-ext:unhealthy.GOPLOYER_EXIT:1.demoapp-xyzd_apnortheast2-v001remained1/1/1and healthy.demoapp-xyzd-canary-v001was deleted.Note:
make lintersfailed once with the localGOROOTpointing at a gvm Go install whose export data did not match the active Go tool version. Re-running withenv -u GOROOTpassed.Remaining