fix: resilient worker accept loop — survive client disconnects by cjchanh · Pull Request #80 · evilsocket/cake

cjchanh · 2026-04-11T00:09:39Z

Summary

The worker accept loop (Worker::run) silently exits on any accept() error, permanently killing the TCP listener. This causes a "one connection works, then dead worker" failure mode.

On iOS this is particularly severe: the first master disconnect (broken pipe, network hiccup, master restart) leaves the worker completely unreachable until the app is force-killed and relaunched.

Changes

Hold TcpListener in Option for in-place rebinding
Classify transient errors (ECONNABORTED, EINTR, TimedOut, WouldBlock) and retry immediately
On fatal accept errors, drop and rebind the listener on the same port without reloading model weights
Log listener fd and local address before each accept cycle for diagnostics
Add describe_listener() helper for cross-platform fd reporting

Testing

cargo test -p cake-core --lib — 589 passed, 0 failed
Live iOS worker (iPad Air M3, iPadOS 26.2): 20 sequential inference requests over 6 minutes with zero drops or reconnections needed
Before this fix: worker died after first master disconnect, every time

Diagnostic log lines (after fix)

worker accept loop awaiting master on 0.0.0.0:10128 fd=9
[10.x.x.x:52341] connection loop ended: broken pipe
worker accept loop awaiting master on 0.0.0.0:10128 fd=9   ← re-enters accept

Recovery path (fatal accept error):

accept failed on 0.0.0.0:10128; dropping listener and rebinding ...
worker listener bound on 0.0.0.0:10128 fd=12

Fixes #79

The worker accept loop (`Worker::run`) previously used `while let Ok(...) = self.listener.accept()` which silently exited on any accept error, permanently killing the TCP listener. This caused a "one connection works, then dead worker" failure mode, particularly severe on iOS where the first master disconnect left the worker unreachable until a full app restart. Changes: - Hold TcpListener in Option for in-place rebinding - Classify transient errors (ECONNABORTED, EINTR, TimedOut, WouldBlock) and retry immediately instead of exiting - On fatal accept errors, drop and rebind the listener on the same port without reloading model weights - Log listener fd and local address before each accept cycle for diagnostics - Add describe_listener() helper for cross-platform fd reporting Tested: 20 sequential inference requests over 6 minutes on an iOS worker (iPad M3) with zero drops or reconnections needed. Fixes evilsocket#79

cjchanh · 2026-04-30T17:58:54Z

This fix is still relevant from my side, and I am still maintaining the branch.
It keeps the worker accept loop alive after transient disconnects without changing the intended worker setup flow.
If this is still in scope for upstream, I can respond to review feedback or adjust the patch as needed.

cjchanh · 2026-05-01T00:38:11Z

Closing this — upstream #84 (merged 2026-04-24) addresses the iOS connection failure path under issue #79 with a different fix (retry logic + spawn_blocking). The narrow worker-loop diff here doesn't add value on top of that. Thanks for the merge of #84.

cjchanh closed this May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: resilient worker accept loop — survive client disconnects#80

fix: resilient worker accept loop — survive client disconnects#80
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:fix/accept-loop-resilience

cjchanh commented Apr 11, 2026 •

edited

Loading

Uh oh!

cjchanh commented Apr 30, 2026

Uh oh!

cjchanh commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cjchanh commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Diagnostic log lines (after fix)

Uh oh!

cjchanh commented Apr 30, 2026

Uh oh!

cjchanh commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cjchanh commented Apr 11, 2026 •

edited

Loading