orchestrator: silent sidecar boot failure on first install_embeddings — 60s timeout exceeded by cold-start downloads

## Summary

On a fresh install over a residential connection, the first `install_embeddings(install)` call fails with `Sidecar failed to start. Check the logs for details.` despite a healthy host (Python 3.12.3, uv 0.11.7). A second invocation succeeds because the first attempt populated the uv and HuggingFace caches. The "check the logs" guidance is non-actionable — sidecar stdout/stderr are routed to `"ignore"`, so no log file is produced.

Likely not reproducible on fast/datacenter connections where the cold-start downloads finish well inside the current 60 s window.

## Environment

- Plugin version: 0.25.2
- Host: WSL2 Ubuntu 24.04, residential broadband (~100 Mbps)
- Python 3.12.3, uv 0.11.7

## Root cause

`mcp/server.ts:191-196` (`trySpawn` for `uvx --with-requirements`) sets a **60 s** timeout. On first run that path must:

1. Resolve `requirements.txt` and download wheels (`onnxruntime` ~150 MB, `tokenizers`, `numpy`, `huggingface_hub`).
2. Build a uv-managed venv.
3. Download the bge-m3 ONNX model (~2 GB) from HuggingFace under unauthenticated rate limits.
4. Load the ONNX session and bind the HTTP port.

On residential bandwidth the sum exceeded 60 s; `trySpawn` killed the process before `.sidecar-port` was written. Back-of-envelope: 2.15 GB at ~100 Mbps is ~170 s of download alone — well over budget. At gigabit it's ~17 s, comfortably under.

## Compounding diagnostics gap

`mcp/server.ts:109-110` sets `stdout: "ignore", stderr: "ignore"` on the spawned sidecar. Boot failures emit zero output, even though `system_status` advises "check the logs." There is no log.

## Manual reproduction with visible stderr

`embed_server.py` does emit useful INFO logs — they're discarded by the spawn. Direct invocation captures them:

```bash
uvx --with-requirements <plugin-root>/sidecar/requirements.txt \
    python <plugin-root>/sidecar/embed_server.py \
    --port 0 --port-file /tmp/diag.port
```

Output: `Downloading/caching model …` → `Loading ONNX session …` → `Model ready (dim=1024) …` → `Wrote port N to …`. Any of these would have made the first failure self-diagnosing.

## Why the second call succeeded

The first failed attempt populated `~/.cache/uv/` and `~/.cache/huggingface/hub/models--BAAI--bge-m3` (~2.2 GB). Re-invoking skipped both downloads; sidecar booted in ~10 s and reported `dim=1024`. The underlying boot path is sound — it just needs more time on first run and visible diagnostics when it doesn't get it.

## Recommendations (priority order)

1. **Surface stderr.** Redirect spawned sidecar stderr to `<pluginRoot>/.sidecar.log` (truncated on each spawn) instead of `"ignore"`. Surface the path in `system_status` when sidecar status is `error`. Highest leverage — converts every future variant of this failure into a self-diagnosing one.
2. **Raise the cold-start timeout** to ~180 s, or expose it via env var (`ORCH_SIDECAR_BOOT_TIMEOUT_MS`). Eliminates the failure on slow links without affecting fast-link users.
3. **Cached-model hint in `install_embeddings(check)`.** If the model is in the HF cache but the sidecar isn't running, mention "model already cached, expect ~10 s boot." Disambiguates first-run from broken-run for users on a second attempt.

Happy to follow up with a PR if a maintainer signals interest. Thanks for the plugin — install completed on the second `install_embeddings` call and adoption is underway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

orchestrator: silent sidecar boot failure on first install_embeddings — 60s timeout exceeded by cold-start downloads #1

Summary

Environment

Root cause

Compounding diagnostics gap

Manual reproduction with visible stderr

Why the second call succeeded

Recommendations (priority order)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

orchestrator: silent sidecar boot failure on first install_embeddings — 60s timeout exceeded by cold-start downloads #1

Description

Summary

Environment

Root cause

Compounding diagnostics gap

Manual reproduction with visible stderr

Why the second call succeeded

Recommendations (priority order)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions