Skip to content

bench: MCP-transport sibling of the code-graph track#684

Draft
DvirDukhan wants to merge 6 commits into
dvirdukhan/fix-find-symbol-nested-namefrom
dvirdukhan/bench-mcp-track
Draft

bench: MCP-transport sibling of the code-graph track#684
DvirDukhan wants to merge 6 commits into
dvirdukhan/fix-find-symbol-nested-namefrom
dvirdukhan/bench-mcp-track

Conversation

@DvirDukhan
Copy link
Copy Markdown

@DvirDukhan DvirDukhan commented May 27, 2026

bench: MCP-transport sibling of the code-graph track

Adds a second bench track that exercises code-graph through the
exact transport real-world agents use (Claude Code, Cursor, Cline…) —
JSON-RPC over stdio to a spawned cgraph-mcp server — instead of HTTP
to the FastAPI service.

Same backend graph, different transport. This lets us answer
"does the MCP integration regress accuracy or token usage vs the
direct HTTP path?"
on the same SWE-bench tasks.

What's in

Adapter & shim

  • bench/agents/code_graph_mcp_adapter.py — sync Python adapter that
    spawns cgraph-mcp per call via the official MCP Python SDK.
    Covers all 8 MCP tools.
  • bench/cli/cg-mcp + cg_mcp.py — bash-callable CLI shim mirroring
    the existing cg shim (mini-swe-agent only does bash, so each
    "tool" is one CLI invocation).
  • bench/tools/code_graph_mcp/{tools.yaml,system_preamble.md}
    agent config; excludes ask per Q2 (no nested LLM in the
    benchmarked tool set).

Runner integration (second commit)

  • VALID_CONFIGS gains code_graph_mcp--config code_graph_mcp
    is now a first-class track.
  • New INSTANCE_TEMPLATE_CODE_GRAPH_MCP: mirrors the HTTP template
    but tells the agent to call cg-mcp with $PROJECT_NAME +
    $BRANCH, and to use impact_analysis before non-trivial edits.
  • config_env("code_graph_mcp", ...) prepends the venv bin to
    PATH (so cgraph-mcp is callable from the agent's bash),
    passes FALKORDB_* through, and exports PROJECT_NAME /
    BRANCH.
  • _ensure_indexed_mcp() mirrors _ensure_indexed but goes through
    the MCP adapter. Skip-if-present probe hits FalkorDB's GRAPH.LIST
    directly (one trip, no MCP spawn).

Tests

  • tests/bench/test_cg_mcp_adapter.py — 5 unit + 1 e2e. The e2e is
    double-gated on FalkorDB being reachable AND api.mcp.server being
    importable, so it skips cleanly on branches that pre-date the MCP
    stack.

Validation

Heavy e2e validated locally against the api/ subgraph
(~6.3k nodes / ~6.2k edges) over real stdio:

search_code(prefix=index_repo)  -> id=96, api/cli.py:223
get_callers(96)                  -> [] (CLI entry, correct)
impact_analysis(96, depth=2)     -> [] (consistent)

Runner wiring smoke-verified:

VALID_CONFIGS = ('baseline','lsp','code_graph','code_graph_mcp')
load_instance_template('code_graph_mcp')  contains 'cg-mcp'
config_env populates PROJECT_NAME / BRANCH / FALKORDB_HOST

Stacking

Sits on dvirdukhan/fix-find-symbol-nested-name (the bench harness
PR). The end-to-end test needs the MCP server itself, which lands in
PRs #675#683. Until both stacks merge to staging, the e2e
self-skips and the 5 unit tests cover the adapter logic.

Design notes

  • Per-call spawn. ~0.5–1s startup per cg-mcp … invocation.
    Keeps the bash shim trivially safe (one process per call, no
    shared state) and matches how real agents use MCP servers today.
    A future optimization could persist the server via a side-channel
    daemon if benchmarks show overhead matters.
  • Output shape. Collection-returning tools land their array in
    structuredContent.result per the MCP spec; _extract handles
    both that and the text-chunk JSON shape so the bash agent always
    sees a single clean JSON document.

Adds a second bench track that exercises code-graph through the
exact transport real-world agents use (Claude Code, Cursor, …) —
JSON-RPC over stdio to a spawned `cgraph-mcp` server — instead of
HTTP to the FastAPI service.

Files:

- bench/agents/code_graph_mcp_adapter.py — sync Python adapter that
  spawns cgraph-mcp per call via the official MCP Python SDK. Knows
  the 8-tool MCP surface (search_code, get_callers, get_callees,
  get_dependencies, impact_analysis, find_path, index_repo, ask).
- bench/cli/cg-mcp + cg_mcp.py — bash-callable CLI shim mirroring
  the existing `cg` shim. mini-swe-agent only does bash, so each
  "tool" is one CLI invocation.
- bench/tools/code_graph_mcp/{tools.yaml,system_preamble.md} —
  agent config for the MCP track. Mirrors code_graph; same Q2
  decision to exclude `ask` (no nested LLM in the benchmarked tool
  set).
- tests/bench/test_cg_mcp_adapter.py — 5 unit + 1 e2e test
  (FalkorDB-gated AND MCP-server-gated so it skips cleanly until
  the MCP stack lands on staging).

Heavy e2e validated against the api/ subgraph (~6.3k nodes) over
real stdio: search_code -> get_callers -> impact_analysis returned
expected payloads.

Depends on the MCP stack (PRs #675#683) for cgraph-mcp itself.
Lands cleanly once that stack merges.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c19cc669-d798-4531-9abf-b32ab42f2fe8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dvirdukhan/bench-mcp-track

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

DvirDukhan and others added 2 commits May 27, 2026 14:51
The adapter and shim from the previous commit were inert from the
runner's perspective — VALID_CONFIGS only knew baseline/lsp/code_graph.
This commit makes `--config code_graph_mcp` a first-class track.

Changes in bench/runners/mini_runner.py:

- VALID_CONFIGS gains "code_graph_mcp" (passes argparse + help string).
- New INSTANCE_TEMPLATE_CODE_GRAPH_MCP: mirrors the HTTP code_graph
  template but tells the agent to call `cg-mcp` with $PROJECT_NAME +
  $BRANCH, and to use impact_analysis before non-trivial edits.
- load_instance_template dispatches the new template.
- config_env("code_graph_mcp", ...) prepends venv bin to PATH (so
  cgraph-mcp is callable from the agent's bash), passes FALKORDB_*
  through to the spawned MCP server, and exports PROJECT_NAME +
  BRANCH which the preamble references.
- New _ensure_indexed_mcp() mirrors _ensure_indexed but goes through
  the bench MCP adapter instead of HTTP. Skip-if-present probe hits
  FalkorDB's GRAPH.LIST directly (one trip, no MCP spawn).
- Per-instance loop now dispatches to _ensure_indexed_mcp for the
  new config.

Smoke-verified that:
- VALID_CONFIGS == ('baseline','lsp','code_graph','code_graph_mcp')
- load_instance_template('code_graph_mcp') contains 'cg-mcp'
- config_env populates PROJECT_NAME/BRANCH/FALKORDB_HOST

Unit tests for the adapter still pass (5 passed, 1 skipped — heavy
e2e double-gated on FalkorDB + api.mcp.server availability).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The mini_runner previously printed a [index] WARN line on
analyze_folder errors and continued. This meant SWE-bench instances
whose path falls outside ALLOWED_ANALYSIS_DIR (e.g. when the API
server is started from a sibling worktree) would silently run the
agent against a missing code-graph project. The agent's first cg
call returns 400 'Missing project ...', the agent falls back to
grep/sed, and we get a token count that looks bad for the
code_graph track but actually reflects 'tool unavailable'.

Two changes:

* analyze_folder errors and httpx exceptions now raise RuntimeError
  with the offending path. This stops the run and surfaces the
  ALLOWED_ANALYSIS_DIR misconfiguration immediately.
* analyze_folder timeout bumped 600s -> 1800s. The 600s default
  was tight for sympy (~5 MB of Python, ~5000 functions) and
  caused a timeout during indexing.

This was discovered while running the first real 3-way SWE-bench
smoke. With the fix and a corrected ALLOWED_ANALYSIS_DIR, the
code_graph track produces sensible numbers (-11% input vs baseline
across the smoke sample vs the prior bogus +4.7%).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dvirdukhan and others added 2 commits May 27, 2026 17:24
The Python analyzer hardcoded `environment_path={path}/venv` when starting
jedi-language-server via multilspy. When the repo had no venv (the common
case for cloned codebases like sphinx, sympy, anything from SWE-bench),
jedi raised `InvalidPythonEnvironment` on every `request_definition()`
call. analyzer.resolve() then swallowed the exception silently and the
indexer produced a graph with DEFINES edges only — zero CALLS, zero
EXTENDS. Benchmark validation showed sphinx (5K functions) and sympy
(41K functions) had no resolved cross-references at all.

Fix:
- source_analyzer.py: prefer {repo}/venv, then {repo}/.venv, then fall
  back to the host interpreter's environment (sys.executable's prefix)
  so jedi always has a valid Python to introspect.
- analyzer.py: log resolve() failures at WARN with file/line context
  instead of swallowing them silently, so the next regression is loud.

Verified: re-indexed sphinx-doc/sphinx-9230 with the fix:
  DEFINES: 5640, CALLS: 4931, EXTENDS: 484 (was DEFINES-only).

Fixes #685.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two production-quality fixes from the calibration run that crashed at
14/30 trajectories:

1. Resume support: skip (instance, cfg) pairs whose trajectory file
   already exists. Lets us recover from crashes/kills without re-running
   completed work (avoids ~$3 of wasted compute on this run).
2. Ignore pathological files at index time: sympy/integrals/rubi/rules
   contains auto-generated 3000-line files with hundreds of unresolvable
   symbols per line. jedi spends hours and never makes progress. Adding
   it to the default ignore list unblocks sympy-19040 (and other sympy
   instances) without affecting graph quality.

Also expanded default ignore set: __pycache__, build, dist, .tox, .eggs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@DvirDukhan
Copy link
Copy Markdown
Author

Calibration complete (n=10 SWE-bench Verified, Sonnet 4.5)

After landing the analyzer fix (PR #686, KeyError defensive patch) and bench resume support:

config sum tokens Δ vs baseline mean tool calls mean wall (s)
baseline 10,846,303 56.7 324.6
lsp 8,346,479 −23.0% 50.4 293.8
code_graph 9,006,634 −17.0% 47.9 267.6

Median per-task: code_graph −22.5%, lsp −22.2%.

Resolve rate: 0/10 across all configs — expected for Sonnet at step-limit 75 on SWE-bench Verified. We're measuring token cost; accuracy needs Opus + higher step limit (next phase).

Per-instance code_graph vs baseline

instance baseline code_graph Δ
django-14771 1,178,608 669,425 −43%
django-16661 1,071,494 1,093,370 +2%
pydata-xarray-4629 1,097,868 223,976 −80%
pytest-6202 1,080,968 362,228 −67%
scikit-learn-10844 300,805 248,138 −18%
sphinx-8035 1,339,158 1,635,271 +22%
sphinx-9230 1,177,778 1,273,885 +8%
sympy-12481 1,434,559 1,665,642 +16%
sympy-19040 1,495,339 1,592,179 +7%
sympy-20154 669,726 242,520 −64%

Conclusions:

  1. Token savings are real and reproducible (~22% median across both tool tracks).
  2. Variance is enormous (−80% to +22%) — n=10 is the minimum for a signal but not enough for tight CIs. Next: n=30.
  3. code_graph wins big on focused-API tasks (xarray, pytest, sympy-20154) and regresses when the graph surfaces many irrelevant nodes (sphinx, sympy-12481/19040).
  4. The new bottleneck is code_graph indexing wall-time — sympy-12481 took ~50 min. Filed Index second_pass: parallelize per-file LSP symbol resolution to cut wall-time on large repos #687 to parallelize second_pass.

Operational fixes that landed this run

  • source_analyzer.py venv-path fallback + WARN logging (PR fix(analyzer): resolve LSP CALLS edges on repos without a venv (#685) #686)
  • source_analyzer.py defensive self.files.get() in second_pass (KeyError on sympy-12481/distributedmodules.py)
  • mini_runner.py resume support (trajectory file = skip)
  • mini_runner.py default ignore list incl. rubi/rules
  • mini_runner.py httpx timeout 1800 → 7200 for slow indexes

Next

In source_analyzer.second_pass, the list of files we iterate can include
paths that first_pass did not add to self.files (e.g. parse errors,
LSP-induced timeouts, or rare edge cases where a candidate file is
present in the input list but never makes it into the files map).
Previously this raised KeyError and aborted the entire index. Hit on
sympy/polys/distributedmodules.py during bench calibration of sympy-12481.

Skip with a WARN log instead so a single bad file no longer takes down
the whole index.

Also bump mini_runner httpx timeout 1800s -> 7200s; observed sympy-12481
index taking >30 min in the field, which previously left the API server
indexing successfully but the runner gave up early.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant