Skip to content

fix: resilient worker accept loop — survive client disconnects#80

Closed
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:fix/accept-loop-resilience
Closed

fix: resilient worker accept loop — survive client disconnects#80
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:fix/accept-loop-resilience

Conversation

@cjchanh
Copy link
Copy Markdown

@cjchanh cjchanh commented Apr 11, 2026

Summary

The worker accept loop (Worker::run) silently exits on any accept() error, permanently killing the TCP listener. This causes a "one connection works, then dead worker" failure mode.

On iOS this is particularly severe: the first master disconnect (broken pipe, network hiccup, master restart) leaves the worker completely unreachable until the app is force-killed and relaunched.

Changes

  • Hold TcpListener in Option for in-place rebinding
  • Classify transient errors (ECONNABORTED, EINTR, TimedOut, WouldBlock) and retry immediately
  • On fatal accept errors, drop and rebind the listener on the same port without reloading model weights
  • Log listener fd and local address before each accept cycle for diagnostics
  • Add describe_listener() helper for cross-platform fd reporting

Testing

  • cargo test -p cake-core --lib — 589 passed, 0 failed
  • Live iOS worker (iPad Air M3, iPadOS 26.2): 20 sequential inference requests over 6 minutes with zero drops or reconnections needed
  • Before this fix: worker died after first master disconnect, every time

Diagnostic log lines (after fix)

worker accept loop awaiting master on 0.0.0.0:10128 fd=9
[10.x.x.x:52341] connection loop ended: broken pipe
worker accept loop awaiting master on 0.0.0.0:10128 fd=9   ← re-enters accept

Recovery path (fatal accept error):

accept failed on 0.0.0.0:10128; dropping listener and rebinding ...
worker listener bound on 0.0.0.0:10128 fd=12

Fixes #79

The worker accept loop (`Worker::run`) previously used
`while let Ok(...) = self.listener.accept()` which silently exited on
any accept error, permanently killing the TCP listener. This caused a
"one connection works, then dead worker" failure mode, particularly
severe on iOS where the first master disconnect left the worker
unreachable until a full app restart.

Changes:
- Hold TcpListener in Option for in-place rebinding
- Classify transient errors (ECONNABORTED, EINTR, TimedOut, WouldBlock)
  and retry immediately instead of exiting
- On fatal accept errors, drop and rebind the listener on the same port
  without reloading model weights
- Log listener fd and local address before each accept cycle for
  diagnostics
- Add describe_listener() helper for cross-platform fd reporting

Tested: 20 sequential inference requests over 6 minutes on an iOS
worker (iPad M3) with zero drops or reconnections needed.

Fixes evilsocket#79
@cjchanh
Copy link
Copy Markdown
Author

cjchanh commented Apr 30, 2026

This fix is still relevant from my side, and I am still maintaining the branch.
It keeps the worker accept loop alive after transient disconnects without changing the intended worker setup flow.
If this is still in scope for upstream, I can respond to review feedback or adjust the patch as needed.

@cjchanh
Copy link
Copy Markdown
Author

cjchanh commented May 1, 2026

Closing this — upstream #84 (merged 2026-04-24) addresses the iOS connection failure path under issue #79 with a different fix (retry logic + spawn_blocking). The narrow worker-loop diff here doesn't add value on top of that. Thanks for the merge of #84.

@cjchanh cjchanh closed this May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

iOS worker: TCP listener accepts SYN probes but refuses full connect() from master

1 participant