Skip to content

Re-fix #3976 (zombie AppSync subscriptions on same-network TCP route swap) without regressing #4220 #4229

Description

@harsh62

Background

PR #4202 fixed #3976 (AppSync subscriptions disconnecting after 2-3 minutes / zombie subscriptions after iOS recycles the TCP route on the same network). However, #4202 was found to regress #4220 — GraphQL subscriptions get stuck in connecting after repeated iOS background/foreground transitions.

@akiramur confirmed v2.56.0 (pre-#4202) does not reproduce #4220. The regression was traced directly to the WebSocket recycle path introduced in #4202.

PR #4228 reverts #4202 to unblock the #4220 regression. That revert also removes the #3976 fix, so #3976 will reappear and must be re-addressed here with a non-regressing approach.

What #4202 did (and why it regressed)

Two layered changes:

1. WebSocketClient.onNetworkStateChange — new (.online, .online) recycle case

case (.online, .online):
    // A second .satisfied emission while already online ⇒ underlying path
    // was swapped; existing URLSessionWebSocketTask is bound to a stale route.
    guard connection?.state == .running else { break }
    connection?.cancel(with: .invalid, reason: nil)
    subject.send(.disconnected(.invalid, nil))
    await createConnectionAndRead()

2. AppSyncRealTimeSubscription.prepareForResubscribe() + call in resumeExistingSubscriptions()

func prepareForResubscribe() {
    state.send(.none)   // reset local state so subscribe() re-sends .start after reconnect
}

Suspected regression mechanism: On repeated background/foreground transitions, NWPathMonitor emits additional .satisfied events that get mapped to (.online, .online). Each one cancels and re-establishes the connection. Under rapid lifecycle churn these recycles appear to interleave/stack, leaving the client stuck in connecting (never completing the reconnect handshake). See #4220 for the repro repo (feature/multi-subscription-repro branch).

Requirements for the re-fix

Ideas to explore

  • Debounce / coalesce (.online, .online) events so lifecycle churn doesn't trigger repeated recycles.
  • Liveness probe instead of blind recycle — only recycle when the existing task is actually dead (e.g. ping/pong timeout or detected read/write failure) rather than on every duplicate .satisfied.
  • Guard against overlapping recycles — ensure a recycle in flight can't be re-triggered, and that the reconnect completes (or is cancelled cleanly) before another starts.
  • Reconcile with app-lifecycle handling so scenePhase transitions don't double-count as path changes.

References

Testing

Validate against the #4220 repro repo (multi-subscription background/foreground churn) and the #3976 scenario (idle subscription surviving a same-network route swap past the ~2-3 min mark) before merging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    apiIssues related to the API categorybugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions